Using Regular expressions in Python
Contents
1. Reference
http://docs.python.org/library/re.html
2. Usage
NOTE: Each of the re.xxx( pattern, ...) functions below also have a regex object equivalent which is useful if you use the same regex pattern multiple times.
e.g. these two are equivalent.
## 1. Using the re.xx function
>>> re.findall('\w+', 'Mary had a little lamb')
['Mary', 'had', 'a', 'little', 'lamb']
## 2. regex object equivalent
# First compile the regex expression
>>> r = re.compile('\w+')
# Then use it similarly to the first method, without the pattern this time.
>>> r.findall('Mary had a little lamb')
['Mary', 'had', 'a', 'little', 'lamb']
Remember though that the performance difference becomes noticeable only when the re.xxx functions are called a large number of times.
Here is a comparison when each of the two alternatives are called a million times.
>>> timeit.timeit("re.findall('\w+', 'Mary had a little lamb')","import re")
3.5729238986968994
>>> timeit.timeit("r.findall('Mary had a little lamb')","import re;r = re.compile('\w+')")
1.9833419322967529
>>> 3.5729238986968994 / 1.9833419322967529
1.8014664241779914
2.1. Searching
2.1.1. re.search()
Search only once in the entire text.
>>> import re
# Searching for words gte 5 characters in length
>>> m = re.search( r'(\w{5,})' , 'Mary had a little lamb.' )
>>> m
<_sre.SRE_Match object at 0xb77fb8a0>
>>> m.re.pattern
'(\\w{5,})'
>>> m.string
'Mary had a little lamb.'
>>> m.start(), m.end()
(11, 17)
>>> m.span()
(11, 17) # same as above
>>> m.string[m.start():m.end()]
'little'
>>> m.groups()
('little',)
>>> m.group(0)
'little'
2.1.2. re.match()
Search only once in the beginning of the text.
>>> import re
>>> m = re.match( r'(\w{5,})' , 'Mary had a little lamb.' )
>>> m
>>> m is None
True
2.1.3. re.findall()
Search throughout the text, as many times as possible.
Returns a list of strings. If groups present in text, returns list of group tuples.
>>> import re
>>> re.findall( r'(\w{4,})' , 'Mary had a little lamb.' )
['Mary', 'little', 'lamb']
>>> re.findall( r'(\d+) (plus|minus) (\d+)' , '20 plus 40 equals 60' )
[('20', 'plus', '40')]
2.1.4. re.finditer()
Search throughout the text, as many times as possible. Same purpose as re.findall but gives you more detail by returning match objects instead of raw matched strings.
1 import re
2
3 text = '40 plus 20 equals 60 and 40 minus 20 equals 20'
4
5 for m in re.finditer(r'(?P<left>\d+) (?P<op>plus|minus) (?P<right>\d+)' , text ):
6 print "Text matched = ", m.group(0)
7 print "Operands = %(left)s , %(right)s" % m.groupdict()
8 print "Operator = %(op)s" % m.groupdict()
9 print
10
$ ./re_test.py Text matched = 40 plus 20 Operands = 40 , 20 Operator = plus Text matched = 40 minus 20 Operands = 40 , 20 Operator = minus
2.2. Splitting
2.2.1. re.split()
Split text into substrings based on regex boundaries.
# Normal use
# Notice how having the split char at the beginning and end of the text
# adds empty strings.
>>> re.split('\W+', '<a href="www.yahoo.com">www.yahoo.com</a>')
['', 'a', 'href', 'www', 'yahoo', 'com', 'www', 'yahoo', 'com', 'a', '']
# Using capturing groups, returns the boundary text also in the result.
>>> re.split('(\W+)', '<a href="www.yahoo.com">www.yahoo.com</a>')
['', '<', 'a', ' ', 'href', '="', 'www', '.', 'yahoo', '.', 'com', '">', 'www', '.', 'yahoo', '.', 'com', '</', 'a', '>', '']
# Restrict splits to N splits. After N splits, also return the rest of the unsplit text.
# So you get (N+1) members in the given list
>>> re.split('\W+', '<a href="www.yahoo.com">www.yahoo.com</a>', 6)
['', 'a', 'href', 'www', 'yahoo', 'com', 'www.yahoo.com</a>']
2.3. Search & Replace
2.3.1. re.sub()
Search for a pattern in the text and replace it with either simple text, or patterns, or even the output of a user-defined function.
>>> import re
# pattern substituted with pattern
#
# Pseudo HTML-to-bbcode conversion
>>> text = '<a href="www.yahoo.com">www.yahoo.com</a>'
>>> search_pat = r'<(/?)(\w+)(.*?)>'
>>> replace_pat = r'[\1\2\3]'
>>> print re.sub( search_pat, replace_pat, text)
[a href="www.yahoo.com"]www.yahoo.com[/a]
# pattern substituted with function
#
# Upcase all HTML tags
>>> replace_func=lambda x:x.group(1)+x.group(2).upper()+x.group(3)
>>> print re.sub( search_pat, replace_func, text)
<A href="www.yahoo.com">www.yahoo.com</A>
3. Flags
Each of the re.xxx() functions also optionally take a flag parameter which indicates how the regex pattern should be interpreted.
>>> import re
>>> text = 'All that glitters is not gold. Is it?'
>>> pattern = '[aeiou]+'
>>> re.findall(pattern, text)
['a', 'i', 'e', 'i', 'o', 'o', 'i']
# Now with flag re.I which means the matching is to be done in a case-independent way
>>> re.findall(pattern, text, re.I)
['A', 'a', 'i', 'e', 'i', 'o', 'o', 'I', 'i']
# You can also embed flag within the regex pattern
>>> pattern_i = '(?i)[aeiou]+'
>>> re.findall(pattern_i, text)
['A', 'a', 'i', 'e', 'i', 'o', 'o', 'I', 'i']
The flags are:
re.I or re.IGNORECASE : m/ /i in Perl. Ignore case while matching text.
re.M or re.MULTILINE : m/ /m in Perl. '^' and '$' match beginning and ends of each newline(\n) separated line in the text.
>>> print re.sub('^','* ','First line\nSecond line') * First line Second line >>> print re.sub('(?m)^','* ','First line\nSecond line') * First line * Second line
re.S or re.DOTALL : m/ /s in Perl . '.' matches even embedded newlines.
>>> text = '<a href="http://yahoo.com">This is the\nhome page\nof yahoo</a>' >>> pattern=r'<[aA].*?>(.*?)</[aA]>' >>> re.findall(pattern,text) [] >>> re.findall(pattern,text, re.S) ['This is the\nhome page\nof yahoo']
re.X or re.VERBOSE : m/ /x in Perl. Ignore spaces and '#' in pattern unless escaped with a '\'.
>>> text = '<a href="http://yahoo.com">This is the\nhome page\nof yahoo</a>' >>> pattern=r'<[aA].*?> # opening tag \ ... (.*?) # enclosed text\ ... </[aA]>' # closing tag >>> re.findall(pattern,text, re.S | re.X) # note how the flags can be added together ['This is the\nhome page\nof yahoo']
