{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parsing HTML using BeautifulSoup\n",
    "\n",
    "As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML. \n",
    "\n",
    "* Fundamentally `BeautifulSoup` object is similar to a complex dict with tree structure.\n",
    "\n",
    "Let us see some basic examples to understand how we can read the tags or attributes or content with in HTML string.\n",
    "* Accessing first occurrence of `tr`.\n",
    "* Accessing first `th` value, we can use attribute `string` or method `get_text()`\n",
    "* Accessing first occurrence of anchor tag\n",
    "* Getting the url from `href` attribute of anchor tag\n",
    "* Accessing the value of anchor tag.\n",
    "* Get all anchor tags\n",
    "* Get all `td` tags\n",
    "* Get value from all `td` tags.\n",
    "* Get values and URLs from anchor tags as a list of dicts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/bR9MzFdRZew?rel=0&amp;controls=1&amp;showinfo=0\" frameborder=\"0\" allowfullscreen></iframe>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/bR9MzFdRZew?rel=0&amp;controls=1&amp;showinfo=0\" frameborder=\"0\" allowfullscreen></iframe>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tbody>\n",
       "        <tr>\n",
       "            <th>Details</th>\n",
       "            <th>URL</th>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>Video Content</td>\n",
       "            <td><a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a></td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>Reference Material</td>\n",
       "            <td><a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a></td>\n",
       "        </tr>\n",
       "    </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<table>\n",
      " <tbody>\n",
      "  <tr>\n",
      "   <th>\n",
      "    Details\n",
      "   </th>\n",
      "   <th>\n",
      "    URL\n",
      "   </th>\n",
      "  </tr>\n",
      "  <tr>\n",
      "   <td>\n",
      "    Video Content\n",
      "   </td>\n",
      "   <td>\n",
      "    <a href=\"https://www.youtube.com/itversityin\">\n",
      "     YouTube Channel\n",
      "    </a>\n",
      "   </td>\n",
      "  </tr>\n",
      "  <tr>\n",
      "   <td>\n",
      "    Reference Material\n",
      "   </td>\n",
      "   <td>\n",
      "    <a href=\"https://www.github.com/dgadiraju/itversity-books\">\n",
      "     GitHub Repository\n",
      "    </a>\n",
      "   </td>\n",
      "  </tr>\n",
      " </tbody>\n",
      "</table>\n"
     ]
    }
   ],
   "source": [
    "%run 03_overview_of_beautifulsoup.ipynb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Accessing first occurrence of `tr`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bs4.BeautifulSoup"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(soup)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<table>\n",
       "<tbody>\n",
       "<tr>\n",
       "<th>Details</th>\n",
       "<th>URL</th>\n",
       "</tr>\n",
       "<tr>\n",
       "<td>Video Content</td>\n",
       "<td><a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>\n",
       "</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td>Reference Material</td>\n",
       "<td><a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a>\n",
       "</td>\n",
       "</tr>\n",
       "</tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tr>\n",
       "<th>Details</th>\n",
       "<th>URL</th>\n",
       "</tr>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.tr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['\\n',\n",
       " <tr>\n",
       " <th>Details</th>\n",
       " <th>URL</th>\n",
       " </tr>,\n",
       " '\\n',\n",
       " <tr>\n",
       " <td>Video Content</td>\n",
       " <td><a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>\n",
       " </td>\n",
       " </tr>,\n",
       " '\\n',\n",
       " <tr>\n",
       " <td>Reference Material</td>\n",
       " <td><a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a>\n",
       " </td>\n",
       " </tr>,\n",
       " '\\n']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(soup.table.tbody.children)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tr>\n",
      "<th>Details</th>\n",
      "<th>URL</th>\n",
      "</tr>\n",
      "\n",
      "\n",
      "<tr>\n",
      "<td>Video Content</td>\n",
      "<td><a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>\n",
      "</td>\n",
      "</tr>\n",
      "\n",
      "\n",
      "<tr>\n",
      "<td>Reference Material</td>\n",
      "<td><a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a>\n",
      "</td>\n",
      "</tr>\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "ele = soup.table.tbody.tr\n",
    "while True:\n",
    "    if not ele:\n",
    "        break\n",
    "    print(ele)\n",
    "    ele = ele.next_sibling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Accessing first `th` value, we can use attribute `string` or method `get_text()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<th>Details</th>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.tr.th"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Details'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.tr.th.string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Details'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.tr.th.get_text()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Accessing first occurrence of anchor tag"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Getting the url from `href` attribute of anchor tag"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://www.youtube.com/itversityin'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.a['href']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Accessing the value of anchor tag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'YouTube Channel'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.a.string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'YouTube Channel'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.a.get_text()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Get all anchor tags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>,\n",
       " <a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a>]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.table.tbody.find_all('a')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>,\n",
       " <a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a>]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all('a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Get all `td` tags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<td>Video Content</td>,\n",
       " <td><a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>\n",
       " </td>,\n",
       " <td>Reference Material</td>,\n",
       " <td><a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a>\n",
       " </td>]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all('td')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<td>Video Content</td>\n",
      "<td><a href=\"https://www.youtube.com/itversityin\">YouTube Channel</a>\n",
      "</td>\n",
      "<td>Reference Material</td>\n",
      "<td><a href=\"https://www.github.com/dgadiraju/itversity-books\">GitHub Repository</a>\n",
      "</td>\n"
     ]
    }
   ],
   "source": [
    "for a in soup.find_all('td'):\n",
    "    print(a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Get value from all `td` tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Video Content\n",
      "None\n",
      "Reference Material\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# If the text in the tag have characters like new line, string might return None\n",
    "for td in soup.find_all('td'):\n",
    "    print(td.string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Video Content\n",
      "YouTube Channel\n",
      "\n",
      "Reference Material\n",
      "GitHub Repository\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# If the text in the tag have characters like new line, we can use get_text\n",
    "for td in soup.find_all('td'):\n",
    "    print(td.get_text())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Video Content\n",
      "YouTube Channel\n",
      "Reference Material\n",
      "GitHub Repository\n"
     ]
    }
   ],
   "source": [
    "# Stripping new line characters\n",
    "for td in soup.find_all('td'):\n",
    "    print(td.get_text().rstrip('\\n'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Get values and URLs from anchor tags as a list of dicts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'description': 'YouTube Channel',\n",
       "  'url': 'https://www.youtube.com/itversityin'},\n",
       " {'description': 'GitHub Repository',\n",
       "  'url': 'https://www.github.com/dgadiraju/itversity-books'}]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "itversity_details = []\n",
    "for a in soup.find_all('a'):\n",
    "    rec = {'description': a.get_text(), 'url': a['href']}\n",
    "    itversity_details.append(rec)\n",
    "\n",
    "itversity_details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'YouTube Channel'"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "itversity_details[0]['description']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://www.youtube.com/itversityin'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "itversity_details[0]['url']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.youtube.com/itversityin\n",
      "https://www.github.com/dgadiraju/itversity-books\n"
     ]
    }
   ],
   "source": [
    "for i in itversity_details:\n",
    "    print(i['url'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}