Overview of BeautifulSoup

Let us get brief overview of BeautifulSoup. It is a library which provides APIs to process HTML data.

  • APIs are built on top of HTML’s DOM structure.

  • If you are not familiar with HTML, make sure to familiarize about following terms.

    • Tags

    • Tag Attributes such as id, class

    • Content or Text (between tags)

  • Here are some of the important tags from the perspective of web scraping.

    • head

    • body

    • Anchor Tag - a

    • Container Tag - div

    • Script Tag - script

    • Paragraph Tag - p

    • Preformatted Tag - pre

    • Table related tags such as table, table header - th, table row - tr, table details - td and more.

  • Here are the steps we can perform to process strings which have HTML in it.

    • Create a string which contain HTML.

    • Create BeautifulSoup object by passing HTML string along with html.parser as part of features keyword argument.

  • Here is a simple HTML which we will use some of the core capabilities of BeautifulSoup.

%%html
<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
        </tr>
    </tbody>
</table>
Details URL
Video Content YouTube Channel
Reference Material GitHub Repository

Note

Let us define string type variable for the above HTML.

html_str = """<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
            </td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
            </td>
        </tr>
    </tbody>
</table>"""

Let us build Beautiful Soup object to leverage methods or APIs scrape the HTML Content.

  • Create BeautifulSoup object by name soup.

  • We can access first occurrence of tag using its reference.

from bs4 import BeautifulSoup
#help(BeautifulSoup)
BeautifulSoup(html_str, features='html.parser')
<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_str, 'html.parser')
print(soup.prettify())
<table>
 <tbody>
  <tr>
   <th>
    Details
   </th>
   <th>
    URL
   </th>
  </tr>
  <tr>
   <td>
    Video Content
   </td>
   <td>
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    </a>
   </td>
  </tr>
  <tr>
   <td>
    Reference Material
   </td>
   <td>
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository
    </a>
   </td>
  </tr>
 </tbody>
</table>