Overview of BeautifulSoup¶
Let us get brief overview of BeautifulSoup. It is a library which provides APIs to process HTML data.
APIs are built on top of HTML’s DOM structure.
If you are not familiar with HTML, make sure to familiarize about following terms.
Tags
Tag Attributes such as id, class
Content or Text (between tags)
Here are some of the important tags from the perspective of web scraping.
head
body
Anchor Tag -
a
Container Tag -
div
Script Tag -
script
Paragraph Tag -
p
Preformatted Tag -
pre
Table related tags such as
table
, table header -th
, table row -tr
, table details -td
and more.
Here are the steps we can perform to process strings which have HTML in it.
Create a string which contain HTML.
Create BeautifulSoup object by passing HTML string along with
html.parser
as part offeatures
keyword argument.
Here is a simple HTML which we will use some of the core capabilities of BeautifulSoup.
%%html
<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
</tr>
</tbody>
</table>
Details | URL |
---|---|
Video Content | YouTube Channel |
Reference Material | GitHub Repository |
Note
Let us define string type variable for the above HTML.
html_str = """<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>"""
Let us build Beautiful Soup object to leverage methods or APIs scrape the HTML Content.
Create BeautifulSoup object by name soup.
We can access first occurrence of tag using its reference.
from bs4 import BeautifulSoup
#help(BeautifulSoup)
BeautifulSoup(html_str, features='html.parser')
<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_str, 'html.parser')
print(soup.prettify())
<table>
<tbody>
<tr>
<th>
Details
</th>
<th>
URL
</th>
</tr>
<tr>
<td>
Video Content
</td>
<td>
<a href="https://www.youtube.com/itversityin">
YouTube Channel
</a>
</td>
</tr>
<tr>
<td>
Reference Material
</td>
<td>
<a href="https://www.github.com/dgadiraju/itversity-books">
GitHub Repository
</a>
</td>
</tr>
</tbody>
</table>