Examples of Regular Expressions in Python

by Alex
Examples of Regular Expressions in Python

Regular Expressions, also called regex, is a syntax or, rather, a language for finding, retrieving, and working with specific text patterns of larger text. It is widely used in projects that include text checking, NLP (Natural Language Processing), and intelligent text processing.

Introduction to Regular Expressions

Regular expressions, also called regex, are used in almost all programming languages. In python, they are implemented in the standard re module. It is widely used in natural language processing, web applications that require text entry validation (e.g., email addresses), and almost all data analysis projects that involve intelligent text processing. This article is divided into 2 parts. Before we get into the regular expression syntax, you should first understand how the re module works. So, first you will get acquainted with the 5 basic functions of the re module and then you will see how to create regular expressions in python. You’ll learn how to build almost any text pattern you’re likely to need when working on text retrieval related projects.

What is a regular expression pattern and how do I compile it?

A regular expression pattern is a special language used to represent common text, numbers or symbols, extracting texts that match that pattern. A basic example is s+. Here s corresponds to any space character. By adding the + operator at the end, the pattern will have at least 1 or more spaces. This pattern will even match the characters tab t. At the end of this article, you’ll find a larger list of regular expression patterns. But before we get to that, let’s see how to compile and work with regular expressions.

>>> import re
>>> > regex = re.compile('s+')

The above code imports the re module and compiles a regular expression pattern that matches at least one or more space characters.

How to split a string separated by a regular expression?

Consider the following text fragment.

>>> text = """100 INF Informatics
213 MAT Mathematics  
156 ANG English"""

I have three courses in the format “[Course Number] [Course Code] [Course Title]”. The spacing between the words is different. I have the task of breaking these three course items into separate units of numbers and words. How do I do this? There are two ways to break them up:

  • Using the re.split method.
  • Calling the split method for the regex object.
# Splits the text by 1 or more spaces  
 >>> re.split('s+', text) 
# or
>>> regex.split(text )  
['100', 'INF', 'Computer Science', '213', 'MAT', 'Math', '156', 'ENG', 'English']

Both of these methods work. But which one should you use in practice? If you intend to use a particular pattern more than once, you are better off compiling a regular expression rather than using re.split multiple times.

Finding matches using findall, search and match

Suppose you want to extract all the course numbers, i.e. 100, 213, and 156, from the text above. How do you do that?

What does re.findall() do?

#find all the numbers in the text
>>> print(text )  
100 INF Informatics
213 MAT Mathematics  
156 ANG English
>>> regex_num = re.compile('d+') 
>>> regex_num.findall(text) 
['100', '213', '156']

In the above code, the special symbol d is a regular expression that matches any number. In this article you will learn more about such patterns. Adding the + symbol to it means there is at least 1 number. Similar to +, there is a * character that requires 0 or more numbers. This makes the presence of a number not necessary to get a match. More on this later. Finally, the findall method extracts all occurrences of 1 or more numbers from the text and returns them to the list.

re.search() vs. re.match()

As the name implies, regex.search() looks for patterns in a given text. But unlike findall, which returns matched pieces of text as a list, regex.search() returns a specific match object. It contains the first and last index of the first matching pattern. Similarly, regex.match() also returns a match object. But the difference is that it requires the pattern to be at the beginning of the text itself.

>>> # create a variable with the text
>>> text2 = """INF Informatics
213 MAT Math 156" ""  
>>> > # compile regex and find the patterns
>>> regex_num = re.compile('d+') 
>>> s = regex_num.search(text2) 
>>> print('First index: ', s.start()) 
>>> print('Last index: ', s.end()) 
>>> print(text2[s.start():s.end()]) 
	
First index: 17 
Last index: 20
213

Alternatively, you can get the same result using the group() method for the match object.

>>> print(s.group()) 
205
>>> m = regex_num.match(text2) 
>>> print(m )  
None

How do I replace one text with another using regular expressions?

To change the text, use regex.sub(). Consider the following modified version of the course text. Here we’ve added a tab after each course code.

>> create a variable with the text
>>> text = """100 INF t Computer Science
213 MAT MAT  
156 ANG t English" "  
>>> print(text)
  
100 INF Informatics
213 MAT Mathematics  
156 ANG English

From the text above, I want to remove all extra spaces and write all the words on one line. To do this, just use regex.sub to replace the pattern s+ with one space .

# replace one or more spaces with 1
>>> regex = re.compile('s+') 
>>> print(regex.sub(' ', text)) 

or

>>> print(re.sub('s+', ' ', text)) 

101 COM Computers 205 MAT Mathematics 189 ENG English

Suppose you want to get rid of extra spaces and output the course entries on a new line. To do this, use a regular expression that skips the newline character but takes all other spaces into account. You can do this by using a negative match (?!n). The pattern checks for the newline character, which in python is n, and skips it.

# remove all spaces but the newline character  
 >>> regex = re.compile('(((?!n)s+)') 
>>> print(regex.sub(' ', text)) 
100 INF Informatics
213 MAT Mathematics  
156 ANG English

Regular Expression Groups

Regular expression groups are a function that allows you to retrieve the desired match objects as separate items. Suppose I want to extract the course number, code, and name as separate items. Without having groups, I would have to write something like this.

>>> text = """100 INF Informatics
213 MAT Math  
156 ANG English""" 
# retrieve all course numbers  
 >>> re.findall('[0-9]+', text) 
# extract all course codes (for Latin [A-Z])
>>> re.findall('[A-YO]{3}', text) 
# extract all course names
>>> > re.findall('[a-ya-Ya-Yo]{4,}', text) 
['100', '213', '156'] 
['INF', 'MAT', 'ANG'] 
['computer science', 'math', 'English']

Let’s see what happens. I compiled 3 separate regular expressions one at a time to match the course numbers, code, and name. For the course number, the pattern [0-9]+ indicates matching all numbers from 0 to 9. Adding a + at the end forces you to find at least 1 match for the numbers 0-9. If you are sure that the course number, will have exactly 3 digits, the pattern could be [0-9]{3}. For the course code, as you might have guessed, [A-Y-E]{3} would match 3 large letters of the alphabet A-Y in a row (the letter “e” is not included in the total letter range). For course names, [a-YA-YA-YO]{4,} we will look for upper- and lower-case a-yas, assuming that all course names will have at least 4 characters. Can you guess what the pattern would be if the maximum character limit in the course name is, say, 20? Now I need to write 3 separate lines to separate the courses. But there is a better way. Regular Expression Groups. Since all records have the same template, you can create a single template for all course records and enter the data you want to extract from a pair of brackets ().

# create groups of course text templates and retrieve them
>>> course_pattern = '([0-9]+)s*([A-YA-YO]{3})s*([a-YA-YO]{4,})' 
>>> re.findall(course_pattern, text )  
[('100', 'INF', 'Computer Science'), ('213', 'MAT', 'Math'), ('156', 'ENG', 'English')]

Note the course number template: [0-9]+, code: [A-YA-YO]{3} and title: [a-YA-YO]{4,} they are all placed in parentheses (), to form a group.

What is “greedy” matching in regular expressions?

By default, regular expressions are supposed to be greedy. This means that they try to extract as much as possible while matching the pattern, even if less is required. Let’s look at an example HTML snippet where we need to retrieve an HTML tag.

>>> text = "<body>Example greedy regular expression matching</body> "  
>>> > re.findall('<.*>', text) 
['<body>Example of greedy regular expression matching</body>']

Instead of matching up to the first ‘>’, which, should have happened at the end of the first body tag, it extracted the whole line. This is the default ‘greedy’ matching inherent in regular expressions. On the other hand, a lazy match ‘takes as little as possible’. This can be set by adding ? at the end of the pattern.

>>> > re.findall('<.*?>', text) 
['<body>', '</body>']

If you only want to get the first match, use the search method instead.

re.search('<.*?>', text).group() 
'<body>'

The most common regular expression syntax and patterns

Now that you know how to use the re module, let’s look at some commonly used substitution patterns. Basic syntax

. One character besides the newline
. Just a period ., backslash removes the magic of all special characters.
d One digit
D One character other than a number
w One alphabetic character including digits
W One character other than a letter and a number
s One space character (including tab and line break)
S One non-space character
b Word boundaries
n New line
t Tab

Modifiers

$ End of line
^ Beginning of line
ab|cd Corresponds to ab or de.
[ab-d] One character: a, b, c, d
[^ab-d] Any character except: a, b, c, d
() Extraction of elements in brackets
(a(bc)) Retrieval of elements in parentheses on the second level

Repeats

[ab]{2} 2 continuous appearances of a or b
[ab]{2,5} 2 to 5 continuous appearances of a or b
[ab]{2,} 2 or more continuous appearances of a or b
+ one or more
* 0 or more
? 0 or 1

Examples of regular expressions

Any character except newline

>>> text = 'python.org '  
>>> print(re.findall('.', text)) # Any character except the newline  
 ['p', 'y', 't', 'h', 'o', 'n', '.', 'o', 'r', 'g']
>>> print(re.findall('...', text))
['pyt', 'hon', '.or']

Dots in string

>>>text = ' python.org '  
>>> print(re.findall('.', text)) # corresponds to the point
['.']
>>> print(re.findall('[^.]', text)) # matches everything but the point
['p', 'y', 't', 'h', 'o', ' n', 'o', 'r', 'g']

Any digit

>>> text = ' 01, Jan 2018 '  
>>> print(re.findall('d+', text)) # Any number (1 or more digits in a row)  
 ['01', '2018']

Anything but a digit

>>> text = ' 01, Jan 2018 '  
>>> print(re.findall('D+', text)) # Any sequence except numbers  
 [', Jan ']

Any letter or number

>>> text = ' 01, Jan 2018 '  
>>> print(re.findall('w+', text)) # Any character(1 or more in a row) 
 ['01', 'Jan', '2018']

Anything but letters or numbers

>>> text = ' 01, Jan 2018 '  
>>> print(re.findall('W+', text)) # Everything but letters and numbers  
 [', ', ' ']

Only letters

>>> text = ' 01, Jan 2018 '  
>>> print(re.findall('[a-ya-ya-yo]+', text)) # The sequence of letters of the Russian alphabet
['Janv']

Match a specified number of times

>>> text = ' 01, Jan 2018 '  
>>> print(re.findall('d{4}', text)) # Any 4 digits in a row
['2018'] 
>>> print(re.findall('d{2,4}', text)) 
['01', '2018']

1 or more occurrences

>>> print(re.findall(r'Co+l', 'So Cooool')) # 1 or more 'o' in the string
['Cooool']

Any number of occurrences (0 or more times)

>>> print(re.findall(r'Pi*lani', 'Pilani')) 
['Pilani']

0 or 1 occurrence

>>> print(re.findall(r'colou?r', 'color')) 
['color']

Word Boundary Word boundaries b usually used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is a space and vice versa. For example, the regular expression btoy matches ‘toy’ in ‘toy cat’, but not in ‘tolstoy’. To make ‘toy’ match ‘tolstoy’, use toyb. Can you think of a regular expression that only matches the first ‘toy’ in ‘play toy broke toys’? (hint: b on both sides) Similarly, B will match any non-boundary( no boundary). For example, Btoy B would correspond to ‘toy’ surrounded by words on both sides, as in ‘antoynet’.

>>> > re.findall(r'btoyb', 'play toy broke toys') # connect toys with constraints on both sides 
['toy']

Practical exercises

Let’s practice a little. It’s time to open your console. (Answer options here.) 1. Extract the username, domain name, and suffix from the given email addresses.

emails = """[email protected]  
[email protected]  
[email protected]"" "  

# required output
[('zuck26', 'facebook', 'com'), ('page33', 'google', 'com'), ('jeff42', 'amazon', 'com')

2. Extract all words beginning with ‘b’ or ‘B’ from the given text.

text = """"Betty bought a bit of butter, But the butter was so bitter, So she bought some better butter, To make the bitter butter better."""

# the required conclusion
['Betty', 'bought', 'bit', 'butter', 'But', 'butter', ' bitter', 'bought', 'better', 'butter', ' bitter', 'butter', 'better'] 

3. Remove all punctuation marks from the sentence

sentence = """A, very very; irregular_sentence""" 

# the required conclusion
A very very irregular sentence

4. Clean up the following tweet so that it contains only one user message. That is, remove all URLs, hashtags, mentions, punctuation, RT, and CC.

tweet = '' 'Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today https://t.co/lbwej0pxOd cc: @garybernhardt #rstats' ''  

# output required
"Good advice What I would do differently if I was learning to code today
  1. Extract all the text snippets between tags from the HTML page: https://raw.githubusercontent.com/selva86/datasets/master/sample.html Code to take out an HTML page:
import requests  
r = requests.get("https://raw.githubusercontent.com/selva86/datasets/master/sample.html ")  
r.text # this is where html is stored

# the required output
['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']

Answers

# 1 task
>>> pattern = r'(w+)@([A-Z0-9]+).([A-Z]{2,4})' 
>>> re.findall(pattern, emails, flags=re.IGNORECASE) 
[('zuck26', 'facebook', 'com'), ('page33', 'google', 'com'), ('jeff42', 'amazon', 'com')]

There are more templates for extracting the domain and suffix. This is just one of them.


# 2 assignment
>>> import re  
>>> > re.findall(r'bBw+', text, flags=re.IGNORECASE) 
['Betty', 'bought', 'bit', 'butter', 'But', 'butter', ' bitter', 'bought', 'better', 'butter', ' bitter', 'butter', ' better'] 

b is to the left of ‘B’, so the word must start with ‘B’. Add flags=re.IGNORECASE to make the pattern case insensitive.


# 3 task
>>> import re  
>>> " ".join(re.split('[;,s_]+', sentence)) 
'A very very irregular sentence' 

>> # 4 task
>>> import re  
>>> > def clean_tweet(tweet) 
		tweet = re.sub('httpS+s*', '', tweet) # will remove the URL  
		tweet = re.sub('RT|cc', '', tweet) # remove RT and cc  
		tweet = re.sub('#S+', '', tweet) # remove hashtags  
		tweet = re.sub('@S+', '', tweet) # remove mentions 
		tweet = re.sub('[%s]'' % re.escape(""!"#$%&'()*+,-./::<=>[email protected][]^_{|}~"""), '', tweet) # remove punctuation marks
		tweet = re.sub('s+', ' ', tweet) # replace whitespace with 1 space 
		return tweet  
	
>>> print(clean_tweet(tweet)) 
'Good advice What I would do differently if I was learning to code today'

# job 5
>>> re.findall('<.*?>(.*)</.*?>', r.text) 
['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']

We hope the information has been useful to you. The goal was to give you some examples of regular expressions in an easy to remember way.

Related Posts

LEAVE A COMMENT