Parsing HTML using BeautifulSoup¶

As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML.

Fundamentally BeautifulSoup object is similar to a complex dict with tree structure.

Let us see some basic examples to understand how we can read the tags or attributes or content with in HTML string.

Accessing first occurrence of tr.
Accessing first th value, we can use attribute string or method get_text()
Accessing first occurrence of anchor tag
Getting the url from href attribute of anchor tag
Accessing the value of anchor tag.
Get all anchor tags
Get all td tags
Get value from all td tags.
Get values and URLs from anchor tags as a list of dicts

%run 03_overview_of_beautifulsoup.ipynb

Details	URL
Video Content	YouTube Channel
Reference Material	GitHub Repository

<table>
 <tbody>
  <tr>
   <th>
    Details
   </th>
   <th>
    URL
   </th>
  </tr>
  <tr>
   <td>
    Video Content
   </td>
   <td>
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    </a>
   </td>
  </tr>
  <tr>
   <td>
    Reference Material
   </td>
   <td>
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository
    </a>
   </td>
  </tr>
 </tbody>
</table>

Accessing first occurrence of tr

type(soup)

bs4.BeautifulSoup

soup.table

<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>

soup.table.tbody.tr

<tr>
<th>Details</th>
<th>URL</th>
</tr>

list(soup.table.tbody.children)

['\n',
 <tr>
 <th>Details</th>
 <th>URL</th>
 </tr>,
 '\n',
 <tr>
 <td>Video Content</td>
 <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
 </td>
 </tr>,
 '\n',
 <tr>
 <td>Reference Material</td>
 <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
 </td>
 </tr>,
 '\n']

ele = soup.table.tbody.tr
while True:
    if not ele:
        break
    print(ele)
    ele = ele.next_sibling

<tr>
<th>Details</th>
<th>URL</th>
</tr>


<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>


<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>

Accessing first th value, we can use attribute string or method get_text()

soup.table.tbody.tr.th

<th>Details</th>

soup.table.tbody.tr.th.string

'Details'

soup.table.tbody.tr.th.get_text()

'Details'

Accessing first occurrence of anchor tag

soup.table.tbody.a

<a href="https://www.youtube.com/itversityin">YouTube Channel</a>

Getting the url from href attribute of anchor tag

soup.table.tbody.a['href']

'https://www.youtube.com/itversityin'

Accessing the value of anchor tag.

soup.table.tbody.a.string

'YouTube Channel'

soup.table.tbody.a.get_text()

'YouTube Channel'

Get all anchor tags

soup.table.tbody.find_all('a')

[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]

soup.find_all('a')

[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]

Get all td tags

soup.find_all('td')

[<td>Video Content</td>,
 <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
 </td>,
 <td>Reference Material</td>,
 <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
 </td>]

for a in soup.find_all('td'):
    print(a)

<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>

Get value from all td tags.

# If the text in the tag have characters like new line, string might return None
for td in soup.find_all('td'):
    print(td.string)

Video Content
None
Reference Material
None

# If the text in the tag have characters like new line, we can use get_text
for td in soup.find_all('td'):
    print(td.get_text())

Video Content
YouTube Channel

Reference Material
GitHub Repository

# Stripping new line characters
for td in soup.find_all('td'):
    print(td.get_text().rstrip('\n'))

Video Content
YouTube Channel
Reference Material
GitHub Repository

Get values and URLs from anchor tags as a list of dicts

itversity_details = []
for a in soup.find_all('a'):
    rec = {'description': a.get_text(), 'url': a['href']}
    itversity_details.append(rec)

itversity_details

[{'description': 'YouTube Channel',
  'url': 'https://www.youtube.com/itversityin'},
 {'description': 'GitHub Repository',
  'url': 'https://www.github.com/dgadiraju/itversity-books'}]

itversity_details[0]['description']

'YouTube Channel'

itversity_details[0]['url']

'https://www.youtube.com/itversityin'

for i in itversity_details:
    print(i['url'])

https://www.youtube.com/itversityin
https://www.github.com/dgadiraju/itversity-books

Mastering Python

Parsing HTML using BeautifulSoup¶