Using Regular expressions in Python

1. Reference

http://docs.python.org/library/re.html

2. Usage

NOTE: Each of the re.xxx( pattern, ...) functions below also have a regex object equivalent which is useful if you use the same regex pattern multiple times.

e.g. these two are equivalent.

## 1. Using the re.xx function
>>> re.findall('\w+', 'Mary had a little lamb')
['Mary', 'had', 'a', 'little', 'lamb']



## 2. regex object equivalent 

# First compile the regex expression
>>> r = re.compile('\w+')

# Then use it similarly to the first method, without the pattern this time.
>>> r.findall('Mary had a little lamb')
['Mary', 'had', 'a', 'little', 'lamb']

Remember though that the performance difference becomes noticeable only when the re.xxx functions are called a large number of times.

Here is a comparison when each of the two alternatives are called a million times.

>>> timeit.timeit("re.findall('\w+', 'Mary had a little lamb')","import re")
3.5729238986968994

>>> timeit.timeit("r.findall('Mary had a little lamb')","import re;r = re.compile('\w+')")
1.9833419322967529

>>> 3.5729238986968994 / 1.9833419322967529
1.8014664241779914

2.1. Searching

2.1.1. re.search()

Search only once in the entire text.

>>> import re

# Searching for words gte 5 characters in length
>>> m = re.search( r'(\w{5,})' , 'Mary had a little lamb.' )

>>> m
<_sre.SRE_Match object at 0xb77fb8a0>

>>> m.re.pattern
'(\\w{5,})'

>>> m.string
'Mary had a little lamb.'

>>> m.start(), m.end()
(11, 17)

>>> m.span()
(11, 17)  # same as above

>>> m.string[m.start():m.end()]
'little'

>>> m.groups()
('little',)

>>> m.group(0)
'little'

2.1.2. re.match()

Search only once in the beginning of the text.

>>> import re
>>> m = re.match( r'(\w{5,})' , 'Mary had a little lamb.' )
>>> m
>>> m is None
True

2.1.3. re.findall()

Search throughout the text, as many times as possible.

Returns a list of strings. If groups present in text, returns list of group tuples.

>>> import re

>>> re.findall( r'(\w{4,})' , 'Mary had a little lamb.' )
['Mary', 'little', 'lamb']

>>> re.findall( r'(\d+) (plus|minus) (\d+)' , '20 plus 40 equals 60' )
[('20', 'plus', '40')]

2.1.4. re.finditer()

Search throughout the text, as many times as possible. Same purpose as re.findall but gives you more detail by returning match objects instead of raw matched strings.

   1 import re
   2 
   3 text = '40 plus 20 equals 60 and 40 minus 20 equals 20'
   4 
   5 for m in re.finditer(r'(?P<left>\d+) (?P<op>plus|minus) (?P<right>\d+)' , text ):
   6     print "Text matched = ", m.group(0)
   7     print "Operands = %(left)s , %(right)s" % m.groupdict()
   8     print "Operator = %(op)s"               % m.groupdict()
   9     print
  10 

$ ./re_test.py 
Text matched =  40 plus 20
Operands = 40 , 20
Operator = plus

Text matched =  40 minus 20
Operands = 40 , 20
Operator = minus

2.2. Splitting

2.2.1. re.split()

Split text into substrings based on regex boundaries.

# Normal use
 
# Notice how having the split char at the beginning and end of the text
# adds empty strings.
>>> re.split('\W+', '<a href="www.yahoo.com">www.yahoo.com</a>')
['', 'a', 'href', 'www', 'yahoo', 'com', 'www', 'yahoo', 'com', 'a', '']


# Using capturing groups, returns the boundary text also in the result.
>>> re.split('(\W+)', '<a href="www.yahoo.com">www.yahoo.com</a>')
['', '<', 'a', ' ', 'href', '="', 'www', '.', 'yahoo', '.', 'com', '">', 'www', '.', 'yahoo', '.', 'com', '</', 'a', '>', '']



# Restrict splits to N splits. After N splits, also return the rest of the unsplit text.
# So you get (N+1) members in the given list
>>> re.split('\W+', '<a href="www.yahoo.com">www.yahoo.com</a>', 6)
['', 'a', 'href', 'www', 'yahoo', 'com', 'www.yahoo.com</a>']

2.3. Search & Replace

2.3.1. re.sub()

Search for a pattern in the text and replace it with either simple text, or patterns, or even the output of a user-defined function.

>>> import re

# pattern substituted with pattern
#
# Pseudo HTML-to-bbcode conversion

>>> text = '<a href="www.yahoo.com">www.yahoo.com</a>'

>>> search_pat = r'<(/?)(\w+)(.*?)>'

>>> replace_pat = r'[\1\2\3]'

>>> print re.sub( search_pat, replace_pat, text)
[a href="www.yahoo.com"]www.yahoo.com[/a]


# pattern substituted with function
#
# Upcase all HTML tags

>>> replace_func=lambda x:x.group(1)+x.group(2).upper()+x.group(3)
>>> print re.sub( search_pat, replace_func, text)
<A href="www.yahoo.com">www.yahoo.com</A>

3. Flags

Each of the re.xxx() functions also optionally take a flag parameter which indicates how the regex pattern should be interpreted.

>>> import re

>>> text = 'All that glitters is not gold. Is it?'
>>> pattern = '[aeiou]+'
>>> re.findall(pattern, text)
['a', 'i', 'e', 'i', 'o', 'o', 'i']

# Now with flag re.I which means the matching is to be done in a case-independent way
>>> re.findall(pattern, text, re.I)
['A', 'a', 'i', 'e', 'i', 'o', 'o', 'I', 'i']

# You can also embed flag within the regex pattern
>>> pattern_i = '(?i)[aeiou]+'
>>> re.findall(pattern_i, text)
['A', 'a', 'i', 'e', 'i', 'o', 'o', 'I', 'i']

The flags are:

Python/Regex (last edited 2010-09-07 15:07:00 by SandipBhattacharya)