Processing Website Content¶
We can process the website content and extract HTML Tags as well as data using BeautifulSoup.
We have to pass the content using
html.parser
and build the BeautifulSoup object.Let us prettify and print the content.
import requests
python_base_url = 'https://python.itversity.com'
python_url = f'{python_base_url}/mastering-python.html'
python_page = requests.get(python_url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(python_page.content, 'html.parser')
print(soup.prettify())
Let us extract all the
a
tags. We can extract links provided as part of this webpage.Here is the code snippet to get the
a
tags from the landing page.
for a in soup.find_all('a'):
print(a)
<a class="navbar-brand text-wrap" href="index.html">
<h1 class="site-logo" id="site-title">Mastering Python</h1>
</a>
<a class="reference internal" href="#">
Mastering Python
</a>
<a class="reference internal" href="01_overview_of_windows_os/01_overview_of_windows_os.html">
Overview of Windows Operating System
</a>
<a class="reference internal" href="04_postgres_database_operations/01_postgres_database_operations.html">
Perform Database Operations
</a>
<a class="reference internal" href="05_getting_started_with_python/01_getting_started_with_python.html">
Getting Started with Python
</a>
<a class="reference internal" href="06_basic_programming_constructs/01_basic_programming_constructs.html">
Basic Programming Constructs
</a>
<a class="reference internal" href="07_pre_defined_functions/01_pre_defined_functions.html">
Pre-defined Functions
</a>
<a class="reference internal" href="08_user_defined_functions/01_user_defined_functions.html">
User Defined Functions
</a>
<a class="reference internal" href="09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html">
Overview of Collections - list and set
</a>
<a class="reference internal" href="10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html">
Overview of Collections - dict and tuple
</a>
<a class="reference internal" href="11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html">
Manipulating Collections using Loops
</a>
<a class="reference internal" href="12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html">
Development of Map Reduce APIs
</a>
<a class="reference internal" href="13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html">
Understanding Python Map Reduce Libraries
</a>
<a class="reference internal" href="14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html">
Overview of Object Oriented Programming
</a>
<a class="reference internal" href="15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html">
Overview of Pandas Libraries
</a>
<a class="reference internal" href="16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html">
Web Scraping using Beautiful Soup
</a>
<a class="reference internal" href="17_database_programming_crud_operations/01_database_programming_crud_operations.html">
Database Programming – CRUD Operations
</a>
<a class="reference internal" href="18_database_programming_batch_operations/01_database_programming_batch_operations.html">
Database Programming – Batch Operations
</a>
<a class="reference internal" href="19_project_web_scraping_into_database/01_project_web_scraping_into_database.html">
Project – Web Scraping and loading into Database
</a>
<a href="http://notifyme.itversity.com">Newsletter</a>
<a class="dropdown-buttons" href="_sources/mastering-python.ipynb"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Download source file" type="button">.ipynb</button></a>
<a class="repository-button" href="https://github.com/itversity/mastering-python"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Source repository" type="button"><i class="fab fa-github"></i>repository</button></a>
<a class="issues-button" href="https://github.com/itversity/mastering-python/issues/new?title=Issue%20on%20page%20%2Fmastering-python.html&body=Your%20issue%20content%20here."><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Open an issue" type="button"><i class="fas fa-lightbulb"></i>open issue</button></a>
<a class="edit-button" href="https://github.com/itversity/mastering-python/edit/master/mastering-python.ipynb"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Edit this page" type="button"><i class="fas fa-pencil-alt"></i>suggest edit</button></a>
<a class="full-screen-button"><button aria-label="Fullscreen mode" class="btn btn-secondary topbarbtn" data-placement="bottom" data-toggle="tooltip" onclick="toggleFullScreen()" title="Fullscreen mode" type="button"><i class="fas fa-expand"></i></button></a>
<a class="binder-button" href="https://mybinder.org/v2/gh/itversity/mastering-python/master?urlpath=tree/mastering-python.ipynb"><button class="btn btn-secondary topbarbtn" data-placement="left" data-toggle="tooltip" title="Launch Binder" type="button"><img alt="Interact on binder" class="binder-button-logo" src="_static/images/logo_binder.svg"/>Binder</button></a>
<a class="reference internal nav-link" href="#about-python">
About Python
</a>
<a class="reference internal nav-link" href="#course-details">
Course Details
</a>
<a class="reference internal nav-link" href="#desired-audience">
Desired Audience
</a>
<a class="reference internal nav-link" href="#prerequisites">
Prerequisites
</a>
<a class="reference internal nav-link" href="#key-objectives">
Key Objectives
</a>
<a class="reference internal nav-link" href="#training-approach">
Training Approach
</a>
<a class="reference internal nav-link" href="#self-evaluation">
Self Evaluation
</a>
<a class="headerlink" href="#mastering-python" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#about-python" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#course-details" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#desired-audience" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#prerequisites" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#key-objectives" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#training-approach" title="Permalink to this headline">¶</a>
<a class="headerlink" href="#self-evaluation" title="Permalink to this headline">¶</a>
<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>
We can use
field_name.string
to get only the value.
for a in soup.find_all('a'):
print(a.string)
for a in soup.find_all('a'):
print(a.get_text())
We can also get the urls used as part of these
a
tags.
for a in soup.find_all('a'):
print(a['href'])
index.html
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
http://notifyme.itversity.com
_sources/mastering-python.ipynb
https://github.com/itversity/mastering-python
https://github.com/itversity/mastering-python/issues/new?title=Issue%20on%20page%20%2Fmastering-python.html&body=Your%20issue%20content%20here.
https://github.com/itversity/mastering-python/edit/master/mastering-python.ipynb
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-decf6af72393> in <module>
1 for a in soup.find_all('a'):
----> 2 print(a['href'])
/opt/anaconda3/envs/beakerx/lib/python3.6/site-packages/bs4/element.py in __getitem__(self, key)
1404 """tag[key] returns the value of the 'key' attribute for the Tag,
1405 and throws an exception if it's not there."""
-> 1406 return self.attrs[key]
1407
1408 def __iter__(self):
KeyError: 'href'
for a in soup.find_all('a'):
if a.get('href'):
print(a['href'])
index.html
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
http://notifyme.itversity.com
_sources/mastering-python.ipynb
https://github.com/itversity/mastering-python
https://github.com/itversity/mastering-python/issues/new?title=Issue%20on%20page%20%2Fmastering-python.html&body=Your%20issue%20content%20here.
https://github.com/itversity/mastering-python/edit/master/mastering-python.ipynb
https://mybinder.org/v2/gh/itversity/mastering-python/master?urlpath=tree/mastering-python.ipynb
#about-python
#course-details
#desired-audience
#prerequisites
#key-objectives
#training-approach
#self-evaluation
#mastering-python
#about-python
#course-details
#desired-audience
#prerequisites
#key-objectives
#training-approach
#self-evaluation
01_overview_of_windows_os/01_overview_of_windows_os.html
We can also pass attributes such as
class
,id
etc to narrow down the filter for specific class or id.
for a in soup.find_all('a'):
if a.get('class'):
print(a['class'])
['navbar-brand', 'text-wrap']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['dropdown-buttons']
['repository-button']
['issues-button']
['edit-button']
['full-screen-button']
['binder-button']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['right-next']
classes = set()
for a in soup.find_all('a'):
if a.get('class'):
classes.add(tuple(a.get('class')))
classes
{('binder-button',),
('dropdown-buttons',),
('edit-button',),
('full-screen-button',),
('headerlink',),
('issues-button',),
('navbar-brand', 'text-wrap'),
('reference', 'internal'),
('reference', 'internal', 'nav-link'),
('repository-button',),
('right-next',)}
soup.find('a', {'class': 'reference internal'})
<a class="reference internal" href="#">
Mastering Python
</a>
soup.find('a', class_='reference internal')
<a class="reference internal" href="#">
Mastering Python
</a>
We can also access attribute values such as
href
ofa
tag.
for a in soup.find_all('a', {'class': 'reference internal'}):
if a.get('href'):
print(a['href'])
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
for a in soup.find_all('a', class_='reference internal'):
if a.get('href'):
print(a['href'])
#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
17_database_programming_crud_operations/01_database_programming_crud_operations.html
18_database_programming_batch_operations/01_database_programming_batch_operations.html
19_project_web_scraping_into_database/01_project_web_scraping_into_database.html
for a in soup.find_all('a'):
if a.get('id'):
print(a['id'])
next-link
Here is an example to narrow down the filter based on
id
on top ofa
tag.
soup.find('a', {'id': 'next-link'})
<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>
soup.find('a', id='next-link')
<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>