Cleanup HTML using BeautifulSoup

Let us understand how to cleanup HTML using BeautifulSoup while copying content from one site to another.

  • When we copy content from one site to another site, we might run into issues due to conflicting Java Script and CSS.

  • It is better to clean up references to CSS, Java Script and even some of the irrelevant tags.

  • We ran into similar issue while copying content from https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html to a blog post.

  • Here are some of the clean up tasks we will perform to understand BeautifulSoup capabilities to clean up the HTML Content.

    • Remove the script tags along with content.

    • Remove anchor tag as it is having permalink referring to itself.

    • Removing the tags along with the content is called as decompose.

    • Remove div containers while retaining the inner tags with in the div container. Removing the tags with out touching the content in the tag is called as unwrap.

Decomposing Tags

Let us see how we can remove the tag along with the content. It is called as decompose.

import requests

url = 'https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html'
page = requests.get(url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')
mc = soup.find('div', id='main-content')
print(mc.prettify())
script = mc.find('script')
script.decompose()
soup.find('div', id='main-content').find('script')
import requests

url = 'https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html'
page = requests.get(url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')
mc = soup.find('div', id='main-content')
for tag in mc.find_all('script'):
    print(tag)
for tag in mc.find_all('script'):
    tag.decompose()
for tag in mc.find_all('script'):
    print(tag)
print(mc.prettify())
mc.find('a', class_='headerlink')
headerlink = mc.find('a', class_='headerlink')
headerlink.decompose()
mc.find('a', class_='headerlink')

Unwrapping Tags

Let us see how we can remove the tags without deleting the content. It is called as unwrap.

mc.find('div')
for tag in mc.find_all('div'):
    tag.unwrap()
mc.find('div')
print(mc.prettify())
for tag in mc.find_all('span'):
    tag.unwrap()
print(mc.prettify())
  • Here is another example. As most of our pages are in similar structure, we can develop a program which will clean up HTMLs for us so that we can publish the content on some target site or save into the database.

import requests

url = 'https://postgresql.itversity.com/03_writing_basic_sql_queries/08_joining_tables_inner.html'
page = requests.get(url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')
mc = soup.find('div', id='main-content')
print(mc.prettify())
mc = soup.find('div', id='main-content')
for tag in mc.find_all('script'):
    tag.decompose()
headerlink = mc.find('a', class_='headerlink')
headerlink.decompose()
for tag in mc.find_all('div'):
    tag.unwrap()
for tag in mc.find_all('span'):
    tag.unwrap()
print(mc.prettify())