Processing Website Content

We can process the website content and extract HTML Tags as well as data using BeautifulSoup.

  • We have to pass the content using html.parser and build the BeautifulSoup object.

  • Let us prettify and print the content.

import requests

python_base_url = 'https://python.itversity.com'
python_url = f'{python_base_url}/mastering-python.html'
python_page = requests.get(python_url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(python_page.content, 'html.parser')
print(soup.prettify())
  • Let us extract all the a tags. We can extract links provided as part of this webpage.

  • Here is the code snippet to get the a tags from the landing page.

for a in soup.find_all('a'):
    print(a)
<a class="navbar-brand text-wrap" href="index.html">
<h1 class="site-logo" id="site-title">Mastering Python</h1>
</a>
<a class="reference internal" href="#">
   Mastering Python
  </a>
<a class="reference internal" href="01_overview_of_windows_os/01_overview_of_windows_os.html">
   Overview of Windows Operating System
  </a>
<a class="reference internal" href="04_postgres_database_operations/01_postgres_database_operations.html">
   Perform Database Operations
  </a>
<a class="reference internal" href="05_getting_started_with_python/01_getting_started_with_python.html">
   Getting Started with Python
  </a>
<a class="reference internal" href="06_basic_programming_constructs/01_basic_programming_constructs.html">
   Basic Programming Constructs
  </a>
<a class="reference internal" href="07_pre_defined_functions/01_pre_defined_functions.html">
   Pre-defined Functions
  </a>
<a class="reference internal" href="08_user_defined_functions/01_user_defined_functions.html">
   User Defined Functions
  </a>
<a class="reference internal" href="09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html">
   Overview of Collections - list and set
  </a>
<a class="reference internal" href="10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html">
   Overview of Collections - dict and tuple
  </a>
<a class="reference internal" href="11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html">
   Manipulating Collections using Loops
  </a>
<a class="reference internal" href="12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html">
   Development of Map Reduce APIs
  </a>
<a class="reference internal" href="13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html">
   Understanding Python Map Reduce Libraries
  </a>
<a class="reference internal" href="14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html">
   Overview of Object Oriented Programming
  </a>
<a class="reference internal" href="15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html">
   Overview of Pandas Libraries
  </a>
<a class="reference internal" href="16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html">
   Web Scraping using Beautiful Soup
  </a>
<a class="reference internal" href="17_database_programming_crud_operations/01_database_programming_crud_operations.html">
   Database Programming – CRUD Operations
  </a>
<a class="reference internal" href="18_database_programming_batch_operations/01_database_programming_batch_operations.html">
   Database Programming – Batch Operations
  </a>
<a class="reference internal" href="19_project_web_scraping_into_database/01_project_web_scraping_into_database.html">
   Project – Web Scraping and loading into Database
  </a>
<a href="http://notifyme.itversity.com">Newsletter</a>
<a class="dropdown-buttons" href="_sources/mastering-python.ipynb"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Download source file" type="button">.ipynb</button></a>
<a class="repository-button" href="https://github.com/itversity/mastering-python"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Source repository" type="button"><i class="fab fa-github"></i>repository</button></a>
<a class="issues-button" href="https://github.com/itversity/mastering-python/issues/new?title=Issue%20on%20page%20%2Fmastering-python.html&amp;body=Your%20issue%20content%20here."><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Open an issue" type="button"><i class="fas fa-lightbulb"></i>open issue</button></a>
<a class="edit-button" href="https://github.com/itversity/mastering-python/edit/master/mastering-python.ipynb"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Edit this page" type="button"><i class="fas fa-pencil-alt"></i>suggest edit</button></a>
<a class="full-screen-button"><button aria-label="Fullscreen mode" class="btn btn-secondary topbarbtn" data-placement="bottom" data-toggle="tooltip" onclick="toggleFullScreen()" title="Fullscreen mode" type="button"><i class="fas fa-expand"></i></button></a>
<a class="binder-button" href="https://mybinder.org/v2/gh/itversity/mastering-python/master?urlpath=tree/mastering-python.ipynb"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Launch Binder" type="button"><img alt="Interact on binder" class="binder-button-logo" src="_static/images/logo_binder.svg"/>Binder</button></a>
<a class="reference internal nav-link" href="#about-python">
   About Python
  </a>
<a class="reference internal nav-link" href="#course-details">
   Course Details
  </a>
<a class="reference internal nav-link" href="#desired-audience">
   Desired Audience
  </a>
<a class="reference internal nav-link" href="#prerequisites">
   Prerequisites
  </a>
<a class="reference internal nav-link" href="#key-objectives">
   Key Objectives
  </a>
<a class="reference internal nav-link" href="#training-approach">
   Training Approach
  </a>
<a class="reference internal nav-link" href="#self-evaluation">
   Self Evaluation
  </a>
<a class="headerlink" href="#mastering-python" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#about-python" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#course-details" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#desired-audience" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#prerequisites" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#key-objectives" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#training-approach" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#self-evaluation" title="Permalink to this headline">¶</a>
<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>
  • We can use field_name.string to get only the value.

for a in soup.find_all('a'):
    print(a.string)
for a in soup.find_all('a'):
    print(a.get_text())
  • We can also get the urls used as part of these a tags.

for a in soup.find_all('a'):
    print(a['href'])
index.html
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
http://notifyme.itversity.com
_sources/mastering-python.ipynb
https://github.com/itversity/mastering-python
https://github.com/itversity/mastering-python/issues/new?title=Issue%20on%20page%20%2Fmastering-python.html&body=Your%20issue%20content%20here.
https://github.com/itversity/mastering-python/edit/master/mastering-python.ipynb
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-decf6af72393> in <module>
      1 for a in soup.find_all('a'):
----> 2     print(a['href'])

/opt/anaconda3/envs/beakerx/lib/python3.6/site-packages/bs4/element.py in __getitem__(self, key)
   1404         """tag[key] returns the value of the 'key' attribute for the Tag,
   1405         and throws an exception if it's not there."""
-> 1406         return self.attrs[key]
   1407 
   1408     def __iter__(self):

KeyError: 'href'
for a in soup.find_all('a'):
    if a.get('href'):
        print(a['href'])
index.html
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
http://notifyme.itversity.com
_sources/mastering-python.ipynb
https://github.com/itversity/mastering-python
https://github.com/itversity/mastering-python/issues/new?title=Issue%20on%20page%20%2Fmastering-python.html&body=Your%20issue%20content%20here.
https://github.com/itversity/mastering-python/edit/master/mastering-python.ipynb
https://mybinder.org/v2/gh/itversity/mastering-python/master?urlpath=tree/mastering-python.ipynb
#about-python
#course-details
#desired-audience
#prerequisites
#key-objectives
#training-approach
#self-evaluation
#mastering-python
#about-python
#course-details
#desired-audience
#prerequisites
#key-objectives
#training-approach
#self-evaluation
01_overview_of_windows_os/01_overview_of_windows_os.html
  • We can also pass attributes such as class, id etc to narrow down the filter for specific class or id.

for a in soup.find_all('a'):
    if a.get('class'):
        print(a['class'])
['navbar-brand', 'text-wrap']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['dropdown-buttons']
['repository-button']
['issues-button']
['edit-button']
['full-screen-button']
['binder-button']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['right-next']
classes = set()
for a in soup.find_all('a'):
    if a.get('class'):
        classes.add(tuple(a.get('class')))
classes
{('binder-button',),
 ('dropdown-buttons',),
 ('edit-button',),
 ('full-screen-button',),
 ('headerlink',),
 ('issues-button',),
 ('navbar-brand', 'text-wrap'),
 ('reference', 'internal'),
 ('reference', 'internal', 'nav-link'),
 ('repository-button',),
 ('right-next',)}
soup.find('a', {'class': 'reference internal'})
<a class="reference internal" href="#">
   Mastering Python
  </a>
soup.find('a', class_='reference internal')
<a class="reference internal" href="#">
   Mastering Python
  </a>
  • We can also access attribute values such as href of a tag.

for a in soup.find_all('a', {'class': 'reference internal'}):
    if a.get('href'):
        print(a['href'])
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
for a in soup.find_all('a', class_='reference internal'):
    if a.get('href'):
        print(a['href'])
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
for a in soup.find_all('a'):
    if a.get('id'):
        print(a['id'])
next-link
  • Here is an example to narrow down the filter based on id on top of a tag.

soup.find('a', {'id': 'next-link'})
<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>
soup.find('a', id='next-link')
<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>