## Getting URLs from Website

Let us understand how we can get URLs from a web page's nav bar or side bar using `BeautifulSoup`.

* Here are some of the key observations about [https://python.itversity.com/mastering-python.html](ttps://python.itversity.com/mastering-python.html).
  * All the content in the website can be accessed using nav bar on the left side.
  * When we click on a particular topic, it will expand the sub topics.
  * First level links are defined using class as `reference internal`.
  * Second level links defined using class as `reference internal` under `li` with class `toctree-l1 current active`. They are visible only when we click on main topics as part of the nav bar on the left.

In [3]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/QZfEiFsblPg?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

In [1]:
import requests

python_base_url = 'https://python.itversity.com'
python_url = f'{python_base_url}/mastering-python.html'
python_page = requests.get(python_url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(python_page.content, 'html.parser')

Let us get first level urls using `reference internal`.
* Get all the first level urls.
* Here are the observations about all the first level of urls from [https://python.itversity.com/mastering-python.html](ttps://python.itversity.com/mastering-python.html).
  * All the URLs are on left nav bar under `nav` tag.
  * We need to get hrefs from the `nav` tag.
* Here are the steps we are going to follow:
  * Get all the nav tags. We need to use `docs` nav.
  * Get all the hrefs from nav using id

In [2]:
for nav in soup.find_all('nav'):
    print(nav['id'])

bd-docs-nav
bd-toc-nav


In [3]:
nav = soup.find('nav', {'id': 'bd-docs-nav'})

In [4]:
nav = soup.find('nav', id='bd-docs-nav')

In [5]:
for a in nav.find_all('a', {'class': 'reference internal'}):
    print(f"{python_base_url}/{a['href']}")

https://python.itversity.com/#
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/04_postgres_database_operations/01_postgres_database_operations.html
https://python.itversity.com/05_getting_started_with_python/01_getting_started_with_python.html
https://python.itversity.com/06_basic_programming_constructs/01_basic_programming_constructs.html
https://python.itversity.com/07_pre_defined_functions/01_pre_defined_functions.html
https://python.itversity.com/08_user_defined_functions/01_user_defined_functions.html
https://python.itversity.com/09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
https://python.itversity.com/10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
https://python.itversity.com/11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
https://python.itversity.com/12_development_of_map_reduce_apis/01_developme

In [8]:
for a in nav.find_all('a', class_='reference internal'):
    print(f"{python_base_url}/{a['href']}")

https://python.itversity.com/#
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/04_postgres_database_operations/01_postgres_database_operations.html
https://python.itversity.com/05_getting_started_with_python/01_getting_started_with_python.html
https://python.itversity.com/06_basic_programming_constructs/01_basic_programming_constructs.html
https://python.itversity.com/07_pre_defined_functions/01_pre_defined_functions.html
https://python.itversity.com/08_user_defined_functions/01_user_defined_functions.html
https://python.itversity.com/09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
https://python.itversity.com/10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
https://python.itversity.com/11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
https://python.itversity.com/12_development_of_map_reduce_apis/01_developme

In [10]:
for a in nav.find_all('a', {'class': 'reference internal'}):
    if a['href'] != '#':
        print(f"{python_base_url}/{a['href']}")

https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/04_postgres_database_operations/01_postgres_database_operations.html
https://python.itversity.com/05_getting_started_with_python/01_getting_started_with_python.html
https://python.itversity.com/06_basic_programming_constructs/01_basic_programming_constructs.html
https://python.itversity.com/07_pre_defined_functions/01_pre_defined_functions.html
https://python.itversity.com/08_user_defined_functions/01_user_defined_functions.html
https://python.itversity.com/09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
https://python.itversity.com/10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
https://python.itversity.com/11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
https://python.itversity.com/12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
http

In [12]:
first_level_urls = []
for a in nav.find_all('a', class_='reference internal'):
    if a['href'] != '#':
        first_level_urls.append(a['href'])

In [13]:
for url in first_level_urls: print(url)

01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping_u

Let us get second level urls using `reference internal` with in `current reference internal`.
* Get all the first level urls.
* Create soup objects for each of the first level urls and then get content from `toctree-l1 current active` using `reference internal`.
* Make sure the urls are prefixed properly by replacing last part of the url with the `href` extracted.

In [14]:
for first_level_url in first_level_urls:
    url = f"{python_base_url}/{first_level_url}"
    print(url)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    current_nav = soup.find('nav', id='bd-docs-nav')
    current_href = current_nav.find('li', class_='toctree-l1 current active')
    for second_level_href in current_href.find_all('a', class_='reference internal'):
        print(f"{'/'.join(url.split('/')[:-1])}/{second_level_href['href']}")

https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/01_overview_of_windows_os/02_getting_system_details.html
https://python.itversity.com/01_overview_of_windows_os/03_managing_windows_system.html
https://python.itversity.com/01_overview_of_windows_os/04_overview_of_microsoft_office.html
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html
https://python.itversity.com/04_postgres_database_operations/01_postgres_database_operations.html
https://python.itversity.com/04_postgres_database_operations/02_overview_of_sql.html
https://python.itversity.com/04_postgres_database_operations/03_create_database_and_users_table.html
https:/