## Parsing HTML using BeautifulSoup

As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML. 

* Fundamentally `BeautifulSoup` object is similar to a complex dict with tree structure.

Let us see some basic examples to understand how we can read the tags or attributes or content with in HTML string.
* Accessing first occurrence of `tr`.
* Accessing first `th` value, we can use attribute `string` or method `get_text()`
* Accessing first occurrence of anchor tag
* Getting the url from `href` attribute of anchor tag
* Accessing the value of anchor tag.
* Get all anchor tags
* Get all `td` tags
* Get value from all `td` tags.
* Get values and URLs from anchor tags as a list of dicts

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/bR9MzFdRZew?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

In [1]:
%run 03_overview_of_beautifulsoup.ipynb

Details,URL
Video Content,YouTube Channel
Reference Material,GitHub Repository


<table>
 <tbody>
  <tr>
   <th>
    Details
   </th>
   <th>
    URL
   </th>
  </tr>
  <tr>
   <td>
    Video Content
   </td>
   <td>
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    </a>
   </td>
  </tr>
  <tr>
   <td>
    Reference Material
   </td>
   <td>
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository
    </a>
   </td>
  </tr>
 </tbody>
</table>


* Accessing first occurrence of `tr`

In [2]:
type(soup)

bs4.BeautifulSoup

In [3]:
soup.table

<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>

In [4]:
soup.table.tbody.tr

<tr>
<th>Details</th>
<th>URL</th>
</tr>

In [5]:
list(soup.table.tbody.children)

['\n',
 <tr>
 <th>Details</th>
 <th>URL</th>
 </tr>,
 '\n',
 <tr>
 <td>Video Content</td>
 <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
 </td>
 </tr>,
 '\n',
 <tr>
 <td>Reference Material</td>
 <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
 </td>
 </tr>,
 '\n']

In [6]:
ele = soup.table.tbody.tr
while True:
    if not ele:
        break
    print(ele)
    ele = ele.next_sibling

<tr>
<th>Details</th>
<th>URL</th>
</tr>


<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>


<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>




* Accessing first `th` value, we can use attribute `string` or method `get_text()`

In [7]:
soup.table.tbody.tr.th

<th>Details</th>

In [8]:
soup.table.tbody.tr.th.string

'Details'

In [9]:
soup.table.tbody.tr.th.get_text()

'Details'

* Accessing first occurrence of anchor tag

In [15]:
soup.table.tbody.a

<a href="https://www.youtube.com/itversityin">YouTube Channel</a>

* Getting the url from `href` attribute of anchor tag

In [16]:
soup.table.tbody.a['href']

'https://www.youtube.com/itversityin'

* Accessing the value of anchor tag.

In [17]:
soup.table.tbody.a.string

'YouTube Channel'

In [18]:
soup.table.tbody.a.get_text()

'YouTube Channel'

* Get all anchor tags

In [19]:
soup.table.tbody.find_all('a')

[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]

In [20]:
soup.find_all('a')

[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]

* Get all `td` tags

In [21]:
soup.find_all('td')

[<td>Video Content</td>,
 <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
 </td>,
 <td>Reference Material</td>,
 <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
 </td>]

In [22]:
for a in soup.find_all('td'):
    print(a)

<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>


* Get value from all `td` tags.

In [23]:
# If the text in the tag have characters like new line, string might return None
for td in soup.find_all('td'):
    print(td.string)

Video Content
None
Reference Material
None


In [24]:
# If the text in the tag have characters like new line, we can use get_text
for td in soup.find_all('td'):
    print(td.get_text())

Video Content
YouTube Channel

Reference Material
GitHub Repository



In [25]:
# Stripping new line characters
for td in soup.find_all('td'):
    print(td.get_text().rstrip('\n'))

Video Content
YouTube Channel
Reference Material
GitHub Repository


* Get values and URLs from anchor tags as a list of dicts

In [26]:
itversity_details = []
for a in soup.find_all('a'):
    rec = {'description': a.get_text(), 'url': a['href']}
    itversity_details.append(rec)

itversity_details

[{'description': 'YouTube Channel',
  'url': 'https://www.youtube.com/itversityin'},
 {'description': 'GitHub Repository',
  'url': 'https://www.github.com/dgadiraju/itversity-books'}]

In [27]:
itversity_details[0]['description']

'YouTube Channel'

In [28]:
itversity_details[0]['url']

'https://www.youtube.com/itversityin'

In [29]:
for i in itversity_details:
    print(i['url'])

https://www.youtube.com/itversityin
https://www.github.com/dgadiraju/itversity-books
