How to Parse any HTML Element in Python with BeautifulSoup



Python


In this article, we show how to parse any HTML element in Python with BeautifulSoup.

With BeautifulSoup, we can gain the value to any HTML element on a page.

How this is done is simple.

We can use the find() function in BeautifulSoup to find the value of any method. Thus, if we use the find() function and put in the 'title' attribute within this function, we can get the title of the HTML document.

We're going to go over a bunch of examples in this article.


Parsing an HTML Document for the Title

Let's go over how to find the title of an HTML document using BeautifulSoup.

The code to get the title of an HTML document is shown below.



Of course when working with BeautifulSoup, you have to import requests and BeautifulSoup.

We then create a variable, getpage, to retrieve the page, http://www.learningaboutelectronics.com/Articles/Hypothesis-testing-calculator.php

Page retrieval is always done with the requests.get() function.

We then format this page with BeautifulSoup. We create the variable, getpage_soup, and get the content of the page.

We then create a variable named title, which will hold the value of the HTML title element.

We then print out the title of the HTML document.


Parsing an HTML Document for Meta Data

Next we'll show how to obtain all of the meta data from an HTML file.

The meta data contains information about the page, such as what it's about, the descriptive keywords, what type of file it is (e.g., html/text)

The code to get all of the meta data from an HTML file is shown below.



So this is very similar to the first one where we found the title tag, but now we use the function, findall(), instead of find(). This is because there is usually multiple meta tags. While find() only locates the first occurrence of a tag, findAll() finds all occurrences of the tag.


Parsing an HTML Document for the p Tag

Now we will parse the HTML for the p tag.

It's just like what we did with the title tag.

So let's say that we have a paragraph that has a class attribute that is equal to "topsection".

How can we get all paragraph tags that have a class that is equal to "topsection"

And the way we do this is by using a dictionary with a key of class and a value equal to "top".

This is shown in the code below.



So this is all that is needed to parse an HTML document for any HTML element.

If you want to parse an HTML document for an element with a class or id attribute, see the following article: How to Find HTML Elements of a Class or id on a Web Page in Python.

BeautifulSoup really is a great module for parsing HTML elements in Python. It's pretty adaptable.


Related Resources

How to Randomly Select From or Shuffle a List in Python



HTML Comment Box is loading comments...