Regular expressions provide a flexible way to search or match string patterns
in text. A single expression, commonly called a regex, is a string formed
according to the regular expression language.Python’s built-in re
module is
responsible for applying regular expressions to strings.
In this blog, I’ll first introduce regular expression syntax, and then apply them in some examples.
Syntax
Special characters
Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.
.
(dot) In the default mode, this matches any character except a newline. If theDOTALL
flag has been specified, this matches any character including a newline.^
(caret) Matches the start of the string, and inMULTILINE
mode also matches immediately after each newline.$
Matches the end of the string or just before the newline at the end of the string.*
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.+
Causes the resulting RE to match 1 or more repetitions of the preceding RE.?
Causes the resulting RE to match 0 or 1 repetition of the preceding RE.{m,n}
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 ‘a’ characters.[]
Used to indicate a set of characters.(character class)|
A|B
, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.()
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.
Special sequences
Following sequences can be included inside a character class.
\d
Matches any decimal digit; this is equivalent to the class[0-9]
.\D
Matches any non-digit character; this is equivalent to the class[^0-9]
.\s
Matches any whitespace character; this is equivalent to the class[ \t\n\r\f\v]
.\S
Matches any non-whitespace character; this is equivalent to the class[^ \t\n\r\f\v]
.\w
Matches any word; this is equivalent to the class[a-zA-Z0-9_]
.\W
Matches any non-word; this is equivalent to the class[^a-zA-Z0-9_]
.
White space characters
\n
new line\s
space\t
tab\e
escape\f
form feed\r
return
Use ‘\’ to escape special characters
. ^ $ * + ? [ ] ( ) { } | \
Examples
This module provides regular expression matching operations similar to those found in Perl. The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with ‘r’. So r”\n” is a two-character string containing ‘' and ‘n’, while “\n” is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
Match digit
\d
matches decimal digit, {1,3}
stands for length is 1 to 3. In the “value”,
we find 1, 10 and 100.
Match word
\w+
matches 1 or more alphanumeric characters, so we can find all words but
spaces and punctuations.
Match 0 or 1 or more
\d*
matches 0 or more digits, so we can find a ‘1’ and others are ''
.
\d*
matches at least 1 digits, so we can find only ‘1’.
\d?
means 0 or 1 digit, so we get ‘’, ‘1’, ‘’.
Match starting and ending
^1
means not 1, so ^[^1]{2}[^3]$
matches 3 characters which don’t start with
1 and don’t end with 3.
Match character class
This pattern means at least 2 characters which start with 1 or more upper-case letter(s), followed with 1 or more lower-case letter(s).
Match group
In the above example, I want to separate e-mail address into 3 parts, 1st part is a particular email account (before “@”), the second is second-level domain (after “@”) and the third part, top-level domain. Since I specify the 1st part only contains lower-case letters or numeric characters, so for the second example, regular expression doesn’t match the first 3 upper-case letters.
Find in dataframe
Moreover, we can also find something with regex pattern. In this example, I created a dataframe, which contains 3 columns, “Name”, “Birthday’ and “Email”. Now I want to find e-mail address whose account and second-level domain contain only lower-case letters, and its top-level domain contains 2 or 3 lower-case letters. Thus, among 4 e-mail addresses, the second one satisfies our pattern.
Use regex to replace character in dataframe
E-mail addresses contain “@” to specify the second-level. However, we want to
replace it to “[at]” in the dataframe. We can use re.sub()
to
realise it.
Use regex to modify values in some columns
When we need to change values’ order in one column, we can firstly use
re.match()
and Match.groups()
to separate it into
multiple groups, then put them in order.
There is much more to regular expressions in Python, we can find most of them here.
Reference
- Python 3 Programming Tutorial - Regular Expressions / Regex with re
- stackoverflow - applying regex to a pandas dataframe
- Wes McKinney. 2017. “Chapter 7 Data Cleaning and Preparation” Python for Data Analysis DATA WRANGLING WITH PANDAS, NUMPY, AND IPYTHON p 213-216