Introduction to REGEX

REGEX IN PYTHON

 

Regex in Python are the go-to logic for parsing/looking for patterns in text and they are quite powerful to say the least. Regular expressions (regex) are crucial to data science as they allow for analyzing and measuring text based datasets. To work with regular expressions also know as regex for short, we must import the “re” module. This comes in Python’s base library so there is no need to run pip install. Once we import “re” we need to produce a string to search and a pattern to search for. We must make sure we are using raw strings (regular string: “abcd”, raw string: r”abcd”, notice the lower case r in front of the pattern we are describing. Let us look at a code example.

import re

test_string = "falconbirdabcd123bhavABCD123"
pattern = re.compile(r"123")

You can see we have a string to parse, and then we define a pattern by using the re.compile() function (this is not a method!). This creates a pattern object for us, which we can use match finding methods on.

Main Methods

Main match methods that we can use on our pattern object:

import re

test_string = "falconbirdabcd123bhavABCD123"
pattern = re.compile(r"123")

# Iterable object with information on any matches, including their starting and ending index(aka span)
print(pattern.finditer(test_string))  

#List with every instance of the pattern (each as its own element)
print(pattern.findall(test_string)) 

# searches the string and returns the first match
print(pattern.search(test_string))  

# only searches the beginning of the string for the pattern
print(pattern.match(test_string))  

# the entire string must match the pattern
print(pattern.fullmatch(test_string))  

finditer() method

Finditer() method has the most information and creates a match object for us, which allows us to perform even more methods.

# Iterable object with information on any matches, including their starting and ending index(aka span)
matches = pattern.finditer(test_string)  

for match in matches:  # we iterate through our finditer() object
print(match.span())  # gives us the start and the end index
print(match.start())  # gives the start index
print(match.end())  # gives the end index    
print(match.group())  # gives us the raw string of the object match

Meta characters

In regular expressions we have meta characters with a special meaning. That allow us to add functionality to our search patterns. The most popular ones are:

  • . looks for any character except newline
  • ^ string starts with “^123”
  • $ looks for pattern at the very end of the string “123$”
  • * zero or more occurrences of any part of pattern “123*”
  • + one or more occurrences “123+”
  • {} exactly the specified number of occurrences “123{4}”
  • \ escape (used to add meta characters)
  • \d finds any decimal digits
  • \D finds non decimal digits
  • \s finds characters that produce whitespace (\n, \t, “ “)
  • \S matches non whitespace characters

re.sub() Method

Re.sub() allows us to enter a regular expression and then substitute it with something else.

import re

test_string = """This is a string about John Doe or JD for short. JD is a simple man
he likes to eat, sleep, and work. On the weekends JD fishes with his family and afterwards
JD treats his family to ice cream"""

# we pass the regex, pass the replacement, and then pass the string we want to perform sub()   
print(re.sub(r"JD", "John Doe", test_string))
"This is a string about John Doe or John Doe for short. John Doe is a simple man he likes to eat, sleep, and work. On the weekends John Doe fishes with his family and afterwards John Doe treats his family to ice cream"

We can also use functions with re.sub().

test_string = """the barrel contains 1 apple and 2 pears"""
print(re.sub(r"\d", lambda x: str(int(x.group()) ** 2) , test_string))

What we are doing is using group to turn the match object into a string, we then convert it into an integer, next we raise it to the power of 2, and then convert it back to a string. The x parameter represents the pattern we pass as the first argument. This is a bit nuanced, so its not mandatory that you know it (its just useful in some cirucumstances).

Re.split(), another useful method you may already be familiar with using it with as it comes standard in python. We often times use it to split words into a list (were we tell Python what the split point should be). However re.split() is much more powerful than the vanilla .split() method as we can pass regular expressions in.

test_string = """i had fun. i ate food. i went to bed"""
# split each word and removes the split point (period)
print(re.split(r"\.", test_string))  

# split each word and keeps the split point (period)
print(re.split(r"(\.)", test_string))  

Regex in Pandas

Let’s built a quick data frame

our_data_frame = pd.DataFrame({"time": ["200", "2009", "11234"],
"user": ["dan", "amy", "bob"],
"video": ["intro.html", "intro.html", "intro.html"]})

and let’s check out the output.

     time user    video

0    200  dan  intro.html

1   2009  amy  intro.html

2  11234  bob  intro.html

Now we want to replace the “intro.html” so that it’s just “intro” instead. This is called “cleaning” the data

print(our_data_frame.replace(to_replace="[.]html$", value="", regex=True))

Let’s check the new output.

    time user  video

0    200  dan  intro

1   2009  amy  intro

2  11234  bob  intro

For our second regex exercise we will convert game character strengths into numerical values. Why would we do this? Well in many cases, such as AI or machine learning, we need to provide numerical values so that they can be calculated properly. Even in the case of a video game, a string literal representing character strength should still be numerical. So let’s create a data frame that represents a character by ID and there “strength” as a string literal.

game_characters = {"id": [1, 2, 3, 4, 5, 6],
"strength": ["weak", "average", "strong", "strong", "average", "average"]}

our_data_frame = pd.DataFrame(game_characters)
print(our_data_frame)

Output:

    id strength

0   1     weak

1   2  average

2   3   strong

3   4   strong

4   5  average

5   6  average

Now we must convert each “strength” to a numerical value.

our_data_frame = our_data_frame.replace(to_replace="weak", value="1", regex=True)
our_data_frame = our_data_frame.replace(to_replace="average", value="2", regex=True)
our_data_frame = our_data_frame.replace(to_replace="strong", value="3", regex=True)
print(our_data_frame)

and now lets take a look at our new data frame below:

   id     strength

0   1        1
1   2        2
2   3        3
3   4        3
4   5        2
5   6        2

Regular expressions are very powerful, and something we should be familiar with as developers or data scientists. Best way to learn them is by practicing with real exercises and projects.

Python and Excel Projects for practice
Register New Account
Shopping cart