How to Find the Number of Times a Word or Phrase Occurs in a Text in Python using Regular Expressions

In this article, we show how to search text for a word or phrase in Python using regular expressions and then count the number of occurrences of this word or phrase.

Regular expressions is way of extracting text so that we can match certain things we are looking for, such as a word or phrase within a text.

Python's re module allows us to use regular expressions in order to match text.

So we're going to show now how to use regular expressions in Python to match a particular word or phrase, put all occurrences in a list, and then count the number of items in the list. This way, we can know how many times a word or phrase appears within a text.

In the code below, we have a text stored in the variable, phrase.

We find the number of occurrences of the word, 'beautiful'.

import re phrase= "Today I went to the gym. I saw this really beautiful girl. She was really pretty. After I came from the gym, I went to the mall. I saw this other beautiful, beautiful girl. She was like a 10/10. I then went to the supermarket and saw this other beautiful girl. I mean she was gorgeous. My gosh, there are so many beautiful girls out there" patterns= [r'beautiful'] for p in patterns: match= re.findall(p, phrase) print(match) length= len(match) print(length)

So let's now go over this code.

re is the module in Python that allows us to use regular expressions. So we first have to import re in our code, in order to use regular expressions.

After this, we have a variable, phrase, which contains the string that we want to search using regular expressions.

We want to count how many times the word, beautiful, appears within the text.

We create a variable named patterns and set it equal to, [r'beautiful']

What this does is it looks for, beautiful.

Optionally, you can also specify, patterns= [r'(beautiful)'], which does the same exact thing. Remember, in regular expressions, parentheses means to include the phrase specified inside of it as a whole. Don't use brackets, [], because they specify to search individually for the characters in the string. If you specify brackets, the regular expression will match any letter b, e, a, u, t,i,f,u,l in the text. This is not what you want. You want to search for the word, beautiful, not search the text for the letters that are in beautiful. Therefore, you use () parentheses to search for complete phrases. Or you just put the word beautiful without anything.

So after we run the for loop, in which we make all matches to the patterns variable, we then create a variable named match and set it equal to, re.findall(p, phrase). It matches the text in the variable phrase to all matches in the patterns variable.

We then print out match, which contains beautiful a number of times.

To count how many times beautiful appears, we just use the Python len() function, which in this case returns 5.

We print out length, which is 5.

Thus we know how many times the word beautiful appears in the text, in an automated fashion.

It's kind of like notepad or Microsoft Word, in which you search for a word or phrase within the document.

Now let's take this one step further.

As before, you know, we were looking for beautiful within the text.

Suppose, we just want the word, beautiful, alone, by itself, not things such as 'beautifully'.

How can we make sure we only keep count of the word, beautiful, by itself.

Well, if beautiful appears by itself, then what comes after it cannot be an alphabetical character, such as in beautifully.

What comes after it is either a space, a period, or a comma.

We can count how many times beautiful occurs by itself by the following code shown below.

>>> import re >>> phrase= "Today I saw this beautiful girl. She was dressed beautifully in this nice red dress." >>> patterns= [r'beautiful\W+'] >>> for p in patterns: match= re.findall(p, phrase) print(match) >>> length= len(match) >>> print(length) ['beautiful '] 1

So now with this, beautiful appears by itself. A word, such as beautifully, won't register.

But even this is not perfect.

What if there's beautiful1 or beautiful.

We can do even better.

We will now explicity specify which characters can follow beautiful. This can be either a space, a colon, a semicolon, a period, or a question mark.

This is shown in the code below.

>>> import re >>> phrase= "Today I saw this beautiful girl. She was dressed beautifully in this nice red dress." >>> patterns= [r'beautiful[.,?; ]+'] >>> for p in patterns: match= re.findall(p, phrase) print(match) >>> length= len(match) >>> print(length) ['beautiful '] 1

So now with this regular expression we match beautiful only when it is followed either by a space (' '), period (.), comma (,), question mark (?), or semicolon.

Any other variations such as beautifully, beautiful1, beautiful10 are rejected.

This way, we match beautiful in its purest form.

The same thing applies to a phrase.

Let's say within a text, we are looking for the phrase, "pepperoni pizza sandwich"

We put this in a regular expression just as a word.

This is shown below.

>>> import re >>> phrase= "Today I went to the pizza store and had a sandwich. It was a delicious pepperoni pizza sandwich. I loved it" >>> patterns= [r'pepperoni pizza sandwich'] >>> for p in patterns: match= re.findall(p, phrase) print(match) >>> length= len(match) >>> print(length) ['pepperoni pizza sandwich'] 1

So you see phrases work just like words.

And this is how we can find the number of times a word or phrase appears within a text in Python using regular expressions.

Related Resources

How to Randomly Select From or Shuffle a List in Python

HTML Comment Box is loading comments...

Learning about Electronics

How to Find the Number of Times a Word or Phrase Occurs in a Text in Python using Regular Expressions