Parsing HTML using BeautifulSoup¶
As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML.
Fundamentally
BeautifulSoupobject is similar to a complex dict with tree structure.
Let us see some basic examples to understand how we can read the tags or attributes or content with in HTML string.
Accessing first occurrence of
tr.Accessing first
thvalue, we can use attributestringor methodget_text()Accessing first occurrence of anchor tag
Getting the url from
hrefattribute of anchor tagAccessing the value of anchor tag.
Get all anchor tags
Get all
tdtagsGet value from all
tdtags.Get values and URLs from anchor tags as a list of dicts
%run 03_overview_of_beautifulsoup.ipynb
| Details | URL |
|---|---|
| Video Content | YouTube Channel |
| Reference Material | GitHub Repository |
<table>
<tbody>
<tr>
<th>
Details
</th>
<th>
URL
</th>
</tr>
<tr>
<td>
Video Content
</td>
<td>
<a href="https://www.youtube.com/itversityin">
YouTube Channel
</a>
</td>
</tr>
<tr>
<td>
Reference Material
</td>
<td>
<a href="https://www.github.com/dgadiraju/itversity-books">
GitHub Repository
</a>
</td>
</tr>
</tbody>
</table>
Accessing first occurrence of
tr
type(soup)
bs4.BeautifulSoup
soup.table
<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>
soup.table.tbody.tr
<tr>
<th>Details</th>
<th>URL</th>
</tr>
list(soup.table.tbody.children)
['\n',
<tr>
<th>Details</th>
<th>URL</th>
</tr>,
'\n',
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>,
'\n',
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>,
'\n']
ele = soup.table.tbody.tr
while True:
if not ele:
break
print(ele)
ele = ele.next_sibling
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
Accessing first
thvalue, we can use attributestringor methodget_text()
soup.table.tbody.tr.th
<th>Details</th>
soup.table.tbody.tr.th.string
'Details'
soup.table.tbody.tr.th.get_text()
'Details'
Accessing first occurrence of anchor tag
soup.table.tbody.a
<a href="https://www.youtube.com/itversityin">YouTube Channel</a>
Getting the url from
hrefattribute of anchor tag
soup.table.tbody.a['href']
'https://www.youtube.com/itversityin'
Accessing the value of anchor tag.
soup.table.tbody.a.string
'YouTube Channel'
soup.table.tbody.a.get_text()
'YouTube Channel'
Get all anchor tags
soup.table.tbody.find_all('a')
[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
<a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]
soup.find_all('a')
[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
<a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]
Get all
tdtags
soup.find_all('td')
[<td>Video Content</td>,
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>,
<td>Reference Material</td>,
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>]
for a in soup.find_all('td'):
print(a)
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
Get value from all
tdtags.
# If the text in the tag have characters like new line, string might return None
for td in soup.find_all('td'):
print(td.string)
Video Content
None
Reference Material
None
# If the text in the tag have characters like new line, we can use get_text
for td in soup.find_all('td'):
print(td.get_text())
Video Content
YouTube Channel
Reference Material
GitHub Repository
# Stripping new line characters
for td in soup.find_all('td'):
print(td.get_text().rstrip('\n'))
Video Content
YouTube Channel
Reference Material
GitHub Repository
Get values and URLs from anchor tags as a list of dicts
itversity_details = []
for a in soup.find_all('a'):
rec = {'description': a.get_text(), 'url': a['href']}
itversity_details.append(rec)
itversity_details
[{'description': 'YouTube Channel',
'url': 'https://www.youtube.com/itversityin'},
{'description': 'GitHub Repository',
'url': 'https://www.github.com/dgadiraju/itversity-books'}]
itversity_details[0]['description']
'YouTube Channel'
itversity_details[0]['url']
'https://www.youtube.com/itversityin'
for i in itversity_details:
print(i['url'])
https://www.youtube.com/itversityin
https://www.github.com/dgadiraju/itversity-books