Tutorial 1 : Regular Expressions in Python -Finding a word from a given text
WHAT:
Regex stands for ‘Regular Expression’. It is a matching pattern for a substring to be found in a given string. In a layman’s language, it can be considered as a generalized form of a text, be it a word or number or any other character or combination of all to be searched from a given text.
WHY:
Let me try to explain it using an example.
Suppose you are tasked with searching all the employee IDs present in a given text. You know that every employee id starts with two uppercase alphabets followed by five digits. Using this knowledge, your snippet of a python code will look like:
def employee_id(text):
employee_id_list=[]
for x in range(len(text)):
if text[x:x+2].isupper() and text[x+2:x+7].isdigit():
employee_id_list.append(text[x:x+7])
return employee_id_list
employee_id('Employees with IDs AC23455 and HB45968 are to be promoted.')output:
['AC23455', 'HB45968']
Corresponding code using regex:
import re
text = 'Employees with ids AC23455 and HB45968 are to be promoted.'
print(re.findall(r'[A-Z]{2}\d{5}',text))output:
['AC23455', 'HB45968']
Some of you might say that this can also be done in a better way using list comprehension and there is no denying that. But even in that case, you have to use ‘for’ and ‘if’ statements along with isupper() and isdigit() methods. Keeping that in mind, now I want you to appreciate the simplicity and comfort, regex provides with just one line:
r'[A-Z]{2}\d{5}'
At first, it may appear as some complex code. But just bear with me for some time and gradually everything will make sense to you.
HOW:
Let’s learn how to use it. At first we will take simple example of finding all instances of a word or a substring in a given string.
text = ''' Many people still conflate Google with the internet.They don't know that Google is actually a search engine like Bing, Baidu, Yahoo. However aforementioned fact surely reflects the prevalent use of Google. '''import reregex_object = re.compile(r'Google')print(regex_object.findall(text))output:
['Google', 'Google', 'Google']
It’s time to decipher the code.
text
It is the string variable which has a string from which we need to find if the substring ‘Google’ is present in it.
import re
‘re’ is a module. Simply put, module is a python code consisting of functions, classes, variables. So ‘re’ has all functions that we can use specifically for regex. Every time we use regex, we need to import it because python internally doesn’t contain those functions along with python’s built-in functions.
regex_object = re.compile(r’Google’)
Compile a regular expression pattern into a regular expression object, which can be used for matching using methods. By re.compile, regex object is created and regex code r’Google’ is compiled. r’Google’ is regex pattern and ‘Google’ is the substring to be searched from a given string. ‘r’ means a raw string. More about raw string can be found here.
findall() method
This a method which returns a list of all the matches of a substring in the given string. Empty list means no matches found.
Few Examples:
text = " Is nature a creation of God or God itself ? "import reregex_code = re.compile(r'god')
print(regex_code.findall(text))output:
[]
Here empty string is returned because we passed ‘god’ as regex code which is not equivalent to ‘God’. When we are using exact substring in a regex pattern, one should note that it is case-sensitive.
text = " P2P stands for peer-to-peer network. In a peer-to-peer network, peers are computer systems connected to each other via internet connection. "import reregex_code = re.compile(r'P2P')
print(regex_code.findall(text))output:
['P2P']
I hope now you have basic idea about regex. We will dig deeper in the upcoming tutorials. Stay tuned.
Take a deep breathe and move to next tutorial.
Happy learning!!!