Parsing HTML using BeautifulSoup¶
As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML.
Fundamentally
BeautifulSoup
object is similar to a complex dict with tree structure.
Let us see some basic examples to understand how we can read the tags or attributes or content with in HTML string.
Accessing first occurrence of
tr
.Accessing first
th
value, we can use attributestring
or methodget_text()
Accessing first occurrence of anchor tag
Getting the url from
href
attribute of anchor tagAccessing the value of anchor tag.
Get all anchor tags
Get all
td
tagsGet value from all
td
tags.Get values and URLs from anchor tags as a list of dicts
%run 03_overview_of_beautifulsoup.ipynb
Details | URL |
---|---|
Video Content | YouTube Channel |
Reference Material | GitHub Repository |
<table>
<tbody>
<tr>
<th>
Details
</th>
<th>
URL
</th>
</tr>
<tr>
<td>
Video Content
</td>
<td>
<a href="https://www.youtube.com/itversityin">
YouTube Channel
</a>
</td>
</tr>
<tr>
<td>
Reference Material
</td>
<td>
<a href="https://www.github.com/dgadiraju/itversity-books">
GitHub Repository
</a>
</td>
</tr>
</tbody>
</table>
Accessing first occurrence of
tr
type(soup)
bs4.BeautifulSoup
soup.table
<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>
soup.table.tbody.tr
<tr>
<th>Details</th>
<th>URL</th>
</tr>
list(soup.table.tbody.children)
['\n',
<tr>
<th>Details</th>
<th>URL</th>
</tr>,
'\n',
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>,
'\n',
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>,
'\n']
ele = soup.table.tbody.tr
while True:
if not ele:
break
print(ele)
ele = ele.next_sibling
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
Accessing first
th
value, we can use attributestring
or methodget_text()
soup.table.tbody.tr.th
<th>Details</th>
soup.table.tbody.tr.th.string
'Details'
soup.table.tbody.tr.th.get_text()
'Details'
Accessing first occurrence of anchor tag
soup.table.tbody.a
<a href="https://www.youtube.com/itversityin">YouTube Channel</a>
Getting the url from
href
attribute of anchor tag
soup.table.tbody.a['href']
'https://www.youtube.com/itversityin'
Accessing the value of anchor tag.
soup.table.tbody.a.string
'YouTube Channel'
soup.table.tbody.a.get_text()
'YouTube Channel'
Get all anchor tags
soup.table.tbody.find_all('a')
[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
<a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]
soup.find_all('a')
[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
<a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]
Get all
td
tags
soup.find_all('td')
[<td>Video Content</td>,
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>,
<td>Reference Material</td>,
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>]
for a in soup.find_all('td'):
print(a)
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
Get value from all
td
tags.
# If the text in the tag have characters like new line, string might return None
for td in soup.find_all('td'):
print(td.string)
Video Content
None
Reference Material
None
# If the text in the tag have characters like new line, we can use get_text
for td in soup.find_all('td'):
print(td.get_text())
Video Content
YouTube Channel
Reference Material
GitHub Repository
# Stripping new line characters
for td in soup.find_all('td'):
print(td.get_text().rstrip('\n'))
Video Content
YouTube Channel
Reference Material
GitHub Repository
Get values and URLs from anchor tags as a list of dicts
itversity_details = []
for a in soup.find_all('a'):
rec = {'description': a.get_text(), 'url': a['href']}
itversity_details.append(rec)
itversity_details
[{'description': 'YouTube Channel',
'url': 'https://www.youtube.com/itversityin'},
{'description': 'GitHub Repository',
'url': 'https://www.github.com/dgadiraju/itversity-books'}]
itversity_details[0]['description']
'YouTube Channel'
itversity_details[0]['url']
'https://www.youtube.com/itversityin'
for i in itversity_details:
print(i['url'])
https://www.youtube.com/itversityin
https://www.github.com/dgadiraju/itversity-books