The Use of Regular Expressions in Python
Regular expressions (regex) are a powerful tool for matching patterns in text. They can be used in Python to search, replace, and manipulate strings. In this blog post, we will explore the basics of regular expressions in Python, and provide some examples of how to use them.
USING REGULAR EXPRESSIONS IN PYTHON
In Python, the re
module provides support for regular expressions. The basic steps for using regular expressions in Python are:
- Import the
re
module - Define the pattern you want to match using regular expression syntax
- Use the
re
module to search, replace, or manipulate strings based on the pattern
Here is an example of how to use regular expressions to search for a pattern in a string:
import re text = "My Wi-Fi is awesome!" pattern = r"Wi-Fi" matches = re.findall(pattern, text) print(matches)
In this example, we import the re module, define the pattern we want to match as “Wi-Fi”, and then use the re.findall() method to find all occurrences of the pattern in the text. The output of this code will be the string “Wi-Fi”.
REGULAR EXPRESSION SYNTAX
Regular expressions use a syntax of special characters to represent patterns. Here are some of the most commonly used characters:
.
: Matches any character except newline^
: Matches the start of a string$
: Matches the end of a string*
: Matches zero or more occurrences of the preceding character+
: Matches one or more occurrences of the preceding character?
: Matches zero or one occurrence of the preceding character{}
: Matches the specified number of occurrences of the preceding character[]
: Matches any character within the brackets()
: Groups characters together
Here is an example of how to use some of these characters to create a pattern:
import re text = "My Wi-Fi is awesome!" pattern = r"W.+i" matches = re.findall(pattern, text) print(matches)
In this example, we define the pattern as “W.+i”, which means to match any character between “W” and “i”. The output of this code will be the string “Wi-Fi”.
MORE ON REGULAR EXPRESSION SYNTAX
To dive deeper into the syntax of regular expressions, let’s take a closer look at some of the characters we introduced earlier.
The .
character matches any character except a newline. This means that it can be used to match any character in a string. For example, the pattern .
will match any single character in a string.
The ^
character matches the start of a string. This means that it can be used to match a pattern only if it appears at the start of a string. For example, the pattern ^My
will match the string “My Wi-Fi is awesome!” because it starts with the word “My”.
The $
character matches the end of a string. This means that it can be used to match a pattern only if it appears at the end of a string. For example, the pattern awesome!$
will match the string “My Wi-Fi is awesome!” because it ends with the word “awesome!”.
The *
character matches zero or more occurrences of the preceding character. This means that it can be used to match a pattern that may or may not appear in a string. For example, the pattern o*
will match any string that contains zero or more occurrences of the letter “o”.
The +
character matches one or more occurrences of the preceding character. This means that it can be used to match a pattern that appears one or more times in a string. For example, the pattern o+
will match any string that contains one or more occurrences of the letter “o”.
The ?
character matches zero or one occurrence of the preceding character. This means that it can be used to match a pattern that may or may not appear in a string, but if it does appear, it appears only once. For example, the pattern the?
will match the strings “the” and “thee”.
The {}
character matches the specified number of occurrences of the preceding character. This means that it can be used to match a pattern that appears a specific number of times in a string. For example, the pattern o{2}
will match any string that contains two occurrences of the letter “o”.
The []
character matches any character within the brackets. This means that it can be used to match a range of characters in a string. For example, the pattern [aeiou]
will match any string that contains any of the vowels.
The ()
character groups characters together. This means that it can be used to group a pattern together, which can be useful when using the |
character to match multiple patterns. For example, the pattern (My|is)
will match the strings “My” and “is”.
REGULAR EXPRESSION PATTERN CREATION
My favourite tool to create regular expression patterns is https://regexr.com/.
The website allows you to create regular expressions using a variety of syntax options and provides a live preview of the matches as you build your expression. You can also test your regular expressions against sample text and view the results in real-time. The interface is intuitive and easy to use, with options to customize the expression’s flags and input settings.
Additionally, regexr.com provides a handy reference guide to the regular expression syntax and various operators you can use to create complex patterns. There are also community-contributed expressions and helpful articles on best practices and common use cases for regular expressions.
EXAMPLE
In this example, we are going to try to build a regular expression that will help us to parse AP names and retrieve two pieces of information:
- Site Name
- AP Number
Here is the list of access points name we are dealing with:
- HQ-AP001
- HQ-AP101
- HQ-AP201
- B1-AP01
- B2-AP21
In order to be able to extract the site name and AP number, we will use groups (using parentheses). Here is what the regular expression looks like:
([A-Z0-9]+)-AP(\d{2,3})
Here is what the python code will look like if we are using this pattern in some code:
import re # Define the regular expression pattern pattern = r"([A-Z0-9]+)-AP(\d{2,3})" # Define the list of access point names to parse ap_names = ["HQ-AP001", "HQ-AP101", "HQ-AP201", "B1-AP01", "B2-AP21"] # Loop through each access point name and extract the site name and AP number using the regexp for ap_name in ap_names: matches = re.findall(pattern, ap_name) # Extract the site name and AP number from the matches site_name = matches[0][0] ap_number = matches[0][1] # Print the results print(f"Access point name: {ap_name}") print(f"Site name: {site_name}") print(f"AP number: {ap_number}")
This script first defines the regular expression pattern as ([A-Z0-9]+)-AP(\d{2,3})
to match the provided access point names in the format of “PREFIX-APNUMBER”.
It then defines a list of access point names to parse, and loops through each name to extract the site name and AP number using the re.findall()
method. The results are printed for each access point name in the format of “Access point name: <AP_NAME>, Site name: <SITE_NAME>, AP number: <AP_NUMBER>”.
When run, this script will output the following results:
Access point name: HQ-AP001 Site name: HQ AP number: 001 Access point name: HQ-AP101 Site name: HQ AP number: 101 Access point name: HQ-AP201 Site name: HQ AP number: 201 Access point name: B1-AP01 Site name: B1 AP number: 01 Access point name: B2-AP21 Site name: B2 AP number: 21
CONCLUSION
Regular expressions are a powerful tool for working with strings in Python. By using regular expression syntax, you can search, replace, and manipulate strings based on patterns. With practice, you can become proficient at using regular expressions to solve a wide range of string manipulation tasks. They are quite scary at first but trust me, they can become very useful!