Greedy and Lazy Matching in Python with Regular Expressions


In this article, we show how to perform greedy or lazy matching when dealing with regular expressions in Python.

First, let's understand the terms, greedy and lazy matching.

Let's say we have the following string in Python, shown below:

If you're familiar with HTML, you know that we're making an unordered list of items.

Now let's say we want to write a regular expression so that we get all content from all of the <li> </li> tags.

Greedy matching will grab all of the li tags and return them as if a single unit. It will be "greedy" and grab the first to the last li tags from the above string.

This is greedy matching, when the program takes the whole code (all the li tags) and grabs them as if a single li tag.

Lazy matching, on the other hand, will take the small occurrence of <li></li> tags and, in so doing, returns each <li></li> individually. Many times, lazy matching is what we want.

So now that you know the terms of lazy and greedy matching now, let's go over Python code that performs greedy and then lazy matching, so that you know from a practical point of view.

Greedy Matching

So we'll first show greedy matching in Python with regular expressions.

Greedy matching is shown in the following code below.

So the first thing we do is we import the re module, since we're dealing with regular expressions.

After this, we have a variable, string1, that is set equal to, "<ul><li>Item 1 <li>Item 2 <li>Item 3 <ul>"

We then have a variable, regex, which we set equal to, re.compile("<li>.* </li>")

The asterisk (*) is greedy and will grab the largest possible match. So it will grab the first li tag and the final li tag. So it takes the whole thing.

We then have a variable, matches, which we set equal to re.findall(regex,string1).

This finds all the matches of <li></li>

However, since it's greedy matching, it returns the first li tag on the page and the last li tag on the page and everything in between. This normally isn't the desired result.

Instead, what is desired in an instance like this is use lazy matching.

Lazy Matching

Now we will demonstrate lazy matching.

Lazy matching grabs a <li> tag and the next possible closest </li> tag.

It will not, like greedy matching, grab the first <li> tag and the furthest (or last) </li> tag.

So, the code is very similar to greedy matching, except we add a question mark (?) after the asterisk.

This gives us the following code shown below.

So you can see now that we now have the more desired result.

Our code grabs each individual <li></li> tag and not the whole thing.

So this is greedy vs. lazy matching in Python and how to perform each with regular expressions.

Related Resources

How to Randomly Select From or Shuffle a List in Python

HTML Comment Box is loading comments...