Overview of BeautifulSoup

Let us get brief overview of BeautifulSoup. It is a library which provides APIs to process HTML data.

  • APIs are built on top of HTML’s DOM structure.

  • If you are not familiar with HTML, make sure to familiarize about following terms.

    • Tags

    • Tag Attributes such as id, class

    • Content or Text (between tags)

  • Here are some of the important tags from the perspective of web scraping.

    • head

    • body

    • Anchor Tag - a

    • Container Tag - div

    • Script Tag - script

    • Paragraph Tag - p

    • Preformatted Tag - pre

    • Table related tags such as table, table header - th, table row - tr, table details - td and more.

  • Here are the steps we can perform to process strings which have HTML in it.

    • Create a string which contain HTML.

    • Create BeautifulSoup object by passing HTML string along with html.parser as part of features keyword argument.

  • Here is a simple HTML which we will use some of the core capabilities of BeautifulSoup.

            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
Details URL
Video Content YouTube Channel
Reference Material GitHub Repository


Let us define string type variable for the above HTML.

html_str = """<table>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>

Let us build Beautiful Soup object to leverage methods or APIs scrape the HTML Content.

  • Create BeautifulSoup object by name soup.

  • We can access first occurrence of tag using its reference.

from bs4 import BeautifulSoup
BeautifulSoup(html_str, features='html.parser')
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_str, 'html.parser')
    Video Content
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    Reference Material
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository