{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing HTML using BeautifulSoup\n", "\n", "As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML. \n", "\n", "* Fundamentally `BeautifulSoup` object is similar to a complex dict with tree structure.\n", "\n", "Let us see some basic examples to understand how we can read the tags or attributes or content with in HTML string.\n", "* Accessing first occurrence of `tr`.\n", "* Accessing first `th` value, we can use attribute `string` or method `get_text()`\n", "* Accessing first occurrence of anchor tag\n", "* Getting the url from `href` attribute of anchor tag\n", "* Accessing the value of anchor tag.\n", "* Get all anchor tags\n", "* Get all `td` tags\n", "* Get value from all `td` tags.\n", "* Get values and URLs from anchor tags as a list of dicts" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DetailsURL
Video ContentYouTube Channel
Reference MaterialGitHub Repository
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Details\n", " \n", " URL\n", "
\n", " Video Content\n", " \n", " \n", " YouTube Channel\n", " \n", "
\n", " Reference Material\n", " \n", " \n", " GitHub Repository\n", " \n", "
\n" ] } ], "source": [ "%run 03_overview_of_beautifulsoup.ipynb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Accessing first occurrence of `tr`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(soup)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DetailsURL
Video ContentYouTube Channel\n", "
Reference MaterialGitHub Repository\n", "
" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "Details\n", "URL\n", "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.tr" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['\\n',\n", " \n", " Details\n", " URL\n", " ,\n", " '\\n',\n", " \n", " Video Content\n", " YouTube Channel\n", " \n", " ,\n", " '\\n',\n", " \n", " Reference Material\n", " GitHub Repository\n", " \n", " ,\n", " '\\n']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(soup.table.tbody.children)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Details\n", "URL\n", "\n", "\n", "\n", "\n", "Video Content\n", "YouTube Channel\n", "\n", "\n", "\n", "\n", "\n", "Reference Material\n", "GitHub Repository\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "ele = soup.table.tbody.tr\n", "while True:\n", " if not ele:\n", " break\n", " print(ele)\n", " ele = ele.next_sibling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Accessing first `th` value, we can use attribute `string` or method `get_text()`" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Details" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.tr.th" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Details'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.tr.th.string" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Details'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.tr.th.get_text()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Accessing first occurrence of anchor tag" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "YouTube Channel" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Getting the url from `href` attribute of anchor tag" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.youtube.com/itversityin'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.a['href']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Accessing the value of anchor tag." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'YouTube Channel'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.a.string" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'YouTube Channel'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.a.get_text()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Get all anchor tags" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[YouTube Channel,\n", " GitHub Repository]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.table.tbody.find_all('a')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[YouTube Channel,\n", " GitHub Repository]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Get all `td` tags" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Video Content,\n", " YouTube Channel\n", " ,\n", " Reference Material,\n", " GitHub Repository\n", " ]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('td')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Video Content\n", "YouTube Channel\n", "\n", "Reference Material\n", "GitHub Repository\n", "\n" ] } ], "source": [ "for a in soup.find_all('td'):\n", " print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Get value from all `td` tags." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Video Content\n", "None\n", "Reference Material\n", "None\n" ] } ], "source": [ "# If the text in the tag have characters like new line, string might return None\n", "for td in soup.find_all('td'):\n", " print(td.string)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Video Content\n", "YouTube Channel\n", "\n", "Reference Material\n", "GitHub Repository\n", "\n" ] } ], "source": [ "# If the text in the tag have characters like new line, we can use get_text\n", "for td in soup.find_all('td'):\n", " print(td.get_text())" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Video Content\n", "YouTube Channel\n", "Reference Material\n", "GitHub Repository\n" ] } ], "source": [ "# Stripping new line characters\n", "for td in soup.find_all('td'):\n", " print(td.get_text().rstrip('\\n'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Get values and URLs from anchor tags as a list of dicts" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'description': 'YouTube Channel',\n", " 'url': 'https://www.youtube.com/itversityin'},\n", " {'description': 'GitHub Repository',\n", " 'url': 'https://www.github.com/dgadiraju/itversity-books'}]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itversity_details = []\n", "for a in soup.find_all('a'):\n", " rec = {'description': a.get_text(), 'url': a['href']}\n", " itversity_details.append(rec)\n", "\n", "itversity_details" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'YouTube Channel'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itversity_details[0]['description']" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.youtube.com/itversityin'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itversity_details[0]['url']" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.youtube.com/itversityin\n", "https://www.github.com/dgadiraju/itversity-books\n" ] } ], "source": [ "for i in itversity_details:\n", " print(i['url'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" } }, "nbformat": 4, "nbformat_minor": 4 }