How to Modify a String using Regular Expressions in Python



Python


In this article, we show how to modify a string using the re.compile() function in Python.

Modifying strings is one of the most important things you can do.

Many times, you will need to reformat text that a user enters.

Let's you're asking a user for his/her phone number.

The user enters in a 10-digit number, 5162221111.

This is a valid number. However, it's not easy to look at visually.

Let's say, however, we don't outright reject it, but want to reformat it to, (516)222-111.

With regular expressions, we can do this rather easily.

Without regular expressions, it's a lot more difficult.

We can do this through a concept called back references.

Through back references, we can find each match in a string of a certain pattern we are looking for, create subexpressions for each part of the pattern we are looking for, and then insert whatever needs to be inserted between certain places.

This is a very powerful concept in the world of regular expressions.

It allows us to match any type of pattern we want to find, create subexpressions, and insert or delete whatever we want to in that pattern that we find.

So let's go over an example and then it will be more clear then.

So let's say a user enters in the number, 5161112222. We want to convert this number to (516)111-2222.

We use this through the concept of back references with regular expressions in Python.



So let's now go over this code.

re is the module in Python that allows us to use regular expressions. So we first have to import re in our code, in order to use regular expressions.

After this, we have a variable, string1, which contains the phone number, 5161112222

We then have a variable, regex, which is set equal to, re.compile(r"([\d]{3})([\d]{3})([\d]{4})")

This may look a little complicated. Let's break it down now.

We first create a subexpression composed of the first 3 digits in the string (anything enclosed in parentheses in the re.compile() function is a subexpression). This is the area code.

We then create a second subexpression composed of the next 3 digits of the phone number.

We then create a third subexpression composed of the next 4 digits of the phone number.

So we now have 3 subexpressions from the 10-digit phone number. The first is the area code. The second is the first 3 digits (excluding the area code). And the third is the last 4 digits of the phone number.

So again, 3 subexpressions.

We then modify the string using the statement, string1= re.sub(regex, r"(\1)\2-\3", string1)

This back references now.

Remember that the first subexpression created in the re.compile() function was the area. This is referenced as \1. We put parentheses around it so that we can put parentheses around the area code. We then have the next 3 digits, followed by a dash (-), followed by the last 4 digits. This changes 5161112222 to (516)111-222.

If you want to put a space after the area code, you would do this by adding a space after \1. So the full line would be, string1= re.sub(regex, r"(\1) \2-\3", string1)

This line of code will modify all 10-digit numbers in a string into the format we specified.

Thus, if we have multiple phone numbers in a string, they will all be formatted out.

This is shown in the following code below.



So you can see all the phone numbers get formatted to the way we want.

Let's do a few more examples.


Adding Anchor Tags to Hypertext Links

Let's work out a few more examples to see how we can modify strings using back references with regular expressions in Python.

Let's say we have a list of hypertext links on a page, such as http://www.google.com, http://www.learningaboutelectronics.com, http://www.dropbox

And let's say we want to convert all of these hypertext links to HTML anchor tags, such as <a href="http://www.google.com">google </a>

This is shown in the code below.



We have a string composed of hypertext links.

We then have our regex variable, which is set equal to, re.compile(r"(https?://www.(.*?).com)")

We have a question mark after the s because we don't know whether the hypertext link will be http:// or https://

We have 2 subexpressions within the hypertext link. One covers the entire hypertext link, i.e., http://www.google.com

The other cover only the domain name, i.e., google

We then have our re.sub() method, which modifies the string, putting the hypertext link within HTML anchor tags

\1 references the entire hypertext link, which is needed for the href attribute of the HTML anchor tag.

\2 references the domain name, which what the user will see displayed.

And this is another example of modifying a string using back references with regular expressions in Python.


Mathematical Expressions

We will do one last example of modifying strings using back references with regular expressions.

Putting a number next to x in a mathematical expression is equal to that number multiplied by x.

Many Python modules need the expression as n*x (where n is a number) in order to work.

If expressed as nx, the program will throw an error.

This is why we will create a program that substitute nx and replaces it with n*x.

This is shown in the code below.



So we have a mathematical expression in the variable, string1. This mathematical expression takes the form nx.

We have a variable, regex, which is set equal to, re.compile(r"([\d]+)(x)")

This regular expression looks for any number of digits followed by an x. With this regex variable, we have 2 subexpressions. The first is the digit and the second is the x.

We then use the re.sub() function to find this pattern and substitute it with n*x instead of nx.

\1 references the digit before the x.

\2 references the x.

The phrase, re.sub(regex, r"\1*\2", string1), adds in a * in between the digit and the x.

So this just served as another example to see how strings can be modified using back references with regular expressions in Python.


Related Resources

How to Randomly Select From or Shuffle a List in Python



HTML Comment Box is loading comments...