Overview of BeautifulSoup¶

Let us get brief overview of BeautifulSoup. It is a library which provides APIs to process HTML data.

APIs are built on top of HTML’s DOM structure.
If you are not familiar with HTML, make sure to familiarize about following terms.
- Tags
- Tag Attributes such as id, class
- Content or Text (between tags)
Here are some of the important tags from the perspective of web scraping.
- head
- body
- Anchor Tag - a
- Container Tag - div
- Script Tag - script
- Paragraph Tag - p
- Preformatted Tag - pre
- Table related tags such as table, table header - th, table row - tr, table details - td and more.
Here are the steps we can perform to process strings which have HTML in it.
- Create a string which contain HTML.
- Create BeautifulSoup object by passing HTML string along with html.parser as part of features keyword argument.
Here is a simple HTML which we will use some of the core capabilities of BeautifulSoup.

%%html
<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
        </tr>
    </tbody>
</table>

Details	URL
Video Content	YouTube Channel
Reference Material	GitHub Repository

Note

Let us define string type variable for the above HTML.

html_str = """<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
            </td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
            </td>
        </tr>
    </tbody>
</table>"""

Let us build Beautiful Soup object to leverage methods or APIs scrape the HTML Content.

Create BeautifulSoup object by name soup.
We can access first occurrence of tag using its reference.

from bs4 import BeautifulSoup

#help(BeautifulSoup)

BeautifulSoup(html_str, features='html.parser')

<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_str, 'html.parser')
print(soup.prettify())

<table>
 <tbody>
  <tr>
   <th>
    Details
   </th>
   <th>
    URL
   </th>
  </tr>
  <tr>
   <td>
    Video Content
   </td>
   <td>
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    </a>
   </td>
  </tr>
  <tr>
   <td>
    Reference Material
   </td>
   <td>
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository
    </a>
   </td>
  </tr>
 </tbody>
</table>

Mastering Python

Overview of BeautifulSoup¶