Parsing HTML using BeautifulSoup

As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML.

  • Fundamentally BeautifulSoup object is similar to a complex dict with tree structure.

Let us see some basic examples to understand how we can read the tags or attributes or content with in HTML string.

  • Accessing first occurrence of tr.

  • Accessing first th value, we can use attribute string or method get_text()

  • Accessing first occurrence of anchor tag

  • Getting the url from href attribute of anchor tag

  • Accessing the value of anchor tag.

  • Get all anchor tags

  • Get all td tags

  • Get value from all td tags.

  • Get values and URLs from anchor tags as a list of dicts

%run 03_overview_of_beautifulsoup.ipynb
Details URL
Video Content YouTube Channel
Reference Material GitHub Repository
<table>
 <tbody>
  <tr>
   <th>
    Details
   </th>
   <th>
    URL
   </th>
  </tr>
  <tr>
   <td>
    Video Content
   </td>
   <td>
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    </a>
   </td>
  </tr>
  <tr>
   <td>
    Reference Material
   </td>
   <td>
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository
    </a>
   </td>
  </tr>
 </tbody>
</table>
  • Accessing first occurrence of tr

type(soup)
bs4.BeautifulSoup
soup.table
<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>
soup.table.tbody.tr
<tr>
<th>Details</th>
<th>URL</th>
</tr>
list(soup.table.tbody.children)
['\n',
 <tr>
 <th>Details</th>
 <th>URL</th>
 </tr>,
 '\n',
 <tr>
 <td>Video Content</td>
 <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
 </td>
 </tr>,
 '\n',
 <tr>
 <td>Reference Material</td>
 <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
 </td>
 </tr>,
 '\n']
ele = soup.table.tbody.tr
while True:
    if not ele:
        break
    print(ele)
    ele = ele.next_sibling
<tr>
<th>Details</th>
<th>URL</th>
</tr>


<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>


<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
  • Accessing first th value, we can use attribute string or method get_text()

soup.table.tbody.tr.th
<th>Details</th>
soup.table.tbody.tr.th.string
'Details'
soup.table.tbody.tr.th.get_text()
'Details'
  • Accessing first occurrence of anchor tag

soup.table.tbody.a
<a href="https://www.youtube.com/itversityin">YouTube Channel</a>
  • Getting the url from href attribute of anchor tag

soup.table.tbody.a['href']
'https://www.youtube.com/itversityin'
  • Accessing the value of anchor tag.

soup.table.tbody.a.string
'YouTube Channel'
soup.table.tbody.a.get_text()
'YouTube Channel'
  • Get all anchor tags

soup.table.tbody.find_all('a')
[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]
soup.find_all('a')
[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]
  • Get all td tags

soup.find_all('td')
[<td>Video Content</td>,
 <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
 </td>,
 <td>Reference Material</td>,
 <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
 </td>]
for a in soup.find_all('td'):
    print(a)
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
  • Get value from all td tags.

# If the text in the tag have characters like new line, string might return None
for td in soup.find_all('td'):
    print(td.string)
Video Content
None
Reference Material
None
# If the text in the tag have characters like new line, we can use get_text
for td in soup.find_all('td'):
    print(td.get_text())
Video Content
YouTube Channel

Reference Material
GitHub Repository
# Stripping new line characters
for td in soup.find_all('td'):
    print(td.get_text().rstrip('\n'))
Video Content
YouTube Channel
Reference Material
GitHub Repository
  • Get values and URLs from anchor tags as a list of dicts

itversity_details = []
for a in soup.find_all('a'):
    rec = {'description': a.get_text(), 'url': a['href']}
    itversity_details.append(rec)

itversity_details
[{'description': 'YouTube Channel',
  'url': 'https://www.youtube.com/itversityin'},
 {'description': 'GitHub Repository',
  'url': 'https://www.github.com/dgadiraju/itversity-books'}]
itversity_details[0]['description']
'YouTube Channel'
itversity_details[0]['url']
'https://www.youtube.com/itversityin'
for i in itversity_details:
    print(i['url'])
https://www.youtube.com/itversityin
https://www.github.com/dgadiraju/itversity-books