Analyzing Website Data using Pandas

As we understood how to get the data from website using BeautifulSoup, let us go ahead and perform few scenarios to validate data using Pandas.

  • Create data frame using url_and_content_list.

  • Get length of content for each url.

  • Get word count for each url.

  • Get the list of pages with content less than 30 words.

  • Get unique word count for each url.

  • Get number of times each word repeated for each url.

%run 08_processing_data_using_data_frame_apis.ipynb
CPU times: user 5.55 s, sys: 80.1 ms, total: 5.63 s
Wall time: 7.76 s
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html : 233
https://python.itversity.com/01_overview_of_windows_os/02_getting_system_details.html : 463
https://python.itversity.com/01_overview_of_windows_os/03_managing_windows_system.html : 475
https://python.itversity.com/01_overview_of_windows_os/04_overview_of_microsoft_office.html : 573
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html : 39
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html : 41
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html : 38
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html : 28
https://python.itversity.com/04_postgres_database_operations/01_postgres_database_operations.html : 504
https://python.itversity.com/04_postgres_database_operations/02_overview_of_sql.html : 714
  • Create data frame using url_and_content_list

import pandas as pd

url_and_content_df = pd.DataFrame(url_and_content_list, columns=['url', 'content'])
  • Get length of content for each url

url_and_content_df['content_length'] = url_and_content_df['content'].str.len()
url_and_content_df
url content content_length
0 https://python.itversity.com/01_overview_of_wi... \n\n\n\nOverview of Windows Operating System¶\... 233
1 https://python.itversity.com/01_overview_of_wi... \n\n\n\nGetting System Details¶\nLet us unders... 463
2 https://python.itversity.com/01_overview_of_wi... \n\n\n\nManaging Windows System¶\nLet us under... 475
3 https://python.itversity.com/01_overview_of_wi... \n\n\n\nOverview of Microsoft Office¶\nAs IT P... 573
4 https://python.itversity.com/01_overview_of_wi... \n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n 39
... ... ... ...
165 https://python.itversity.com/19_project_web_sc... \n\n\n\nReading the data¶\n\n\n\n\n\n 27
166 https://python.itversity.com/19_project_web_sc... \n\n\n\nValidating data¶\n\n\n\n\n\n 26
167 https://python.itversity.com/19_project_web_sc... \n\n\n\nApply required transformations¶\n\n\n\... 41
168 https://python.itversity.com/19_project_web_sc... \n\n\n\nWriting to Database¶\n\n\n\n\n\n 30
169 https://python.itversity.com/19_project_web_sc... \n\n\n\nRun queries against data¶\n\n\n\n\n\n 35

170 rows × 3 columns

  • Get word count for each url

url_and_content_df['word_count'] = url_and_content_df['content'].str.split(' ').str.len()
url_and_content_df.sort_values('word_count')
url content content_length word_count
117 https://python.itversity.com/14_overview_of_ob... \n\n\n\nPolymorphism¶\n\n\n\n\n\n 23 1
113 https://python.itversity.com/14_overview_of_ob... \n\n\n\nConstructors¶\n\n\n\n\n\n 23 1
114 https://python.itversity.com/14_overview_of_ob... \n\n\n\nMethods¶\n\n\n\n\n\n 18 1
115 https://python.itversity.com/14_overview_of_ob... \n\n\n\nInheritance¶\n\n\n\n\n\n 22 1
116 https://python.itversity.com/14_overview_of_ob... \n\n\n\nEncapsulation¶\n\n\n\n\n\n 24 1
... ... ... ... ...
130 https://python.itversity.com/15_overview_of_pa... \n\n\n\nJoining Data Frames¶\nLet us understan... 20249 1975
106 https://python.itversity.com/13_understanding_... \n\n\n\nRow level transformations using map¶\n... 23527 2105
42 https://python.itversity.com/07_pre_defined_fu... \n\n\n\nString Manipulation Functions¶\nLet us... 15592 3724
123 https://python.itversity.com/15_overview_of_pa... \n\n\n\nData Frames - Basic Operations¶\nHere ... 18321 3752
50 https://python.itversity.com/08_user_defined_f... \n\n\n\nDoc Strings¶\nDocumentation is one of ... 16347 3936

170 rows × 4 columns

  • Get the list of pages with content less than 30 words

for url in url_and_content_df.query('word_count <= 30')['url']:
    print(url)
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html
https://python.itversity.com/07_pre_defined_functions/01_pre_defined_functions.html
https://python.itversity.com/12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
https://python.itversity.com/13_understanding_map_reduce_libraries/08_using_groupby.html
https://python.itversity.com/13_understanding_map_reduce_libraries/09_limitations_of_map_reduce_libraries.html
https://python.itversity.com/14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
https://python.itversity.com/14_overview_of_object_oriented_programming/02_classes_and_objects.html
https://python.itversity.com/14_overview_of_object_oriented_programming/03_constructors.html
https://python.itversity.com/14_overview_of_object_oriented_programming/04_methods.html
https://python.itversity.com/14_overview_of_object_oriented_programming/05_inheritance.html
https://python.itversity.com/14_overview_of_object_oriented_programming/06_encapsulation.html
https://python.itversity.com/14_overview_of_object_oriented_programming/07_polymorphism.html
https://python.itversity.com/14_overview_of_object_oriented_programming/08_dynamic_classes.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/02_problem_statement.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/03_installing_pre-requisites.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/04_overview_of_beautifulsoup.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/05_getting_html_content.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/06_processing_html_content.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/07_creating_data_frame.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/08_processing_data_using_data_frame_apis.html
https://python.itversity.com/19_project_web_scraping_into_database/02_define_problem_statement.html
https://python.itversity.com/19_project_web_scraping_into_database/03_setup_project.html
https://python.itversity.com/19_project_web_scraping_into_database/04_overview_of_python_virtual_environments.html
https://python.itversity.com/19_project_web_scraping_into_database/05_installing_required_libraries.html
https://python.itversity.com/19_project_web_scraping_into_database/06_setup_logging.html
https://python.itversity.com/19_project_web_scraping_into_database/07_modularizing_the_project.html
https://python.itversity.com/19_project_web_scraping_into_database/08_setup_database.html
https://python.itversity.com/19_project_web_scraping_into_database/10_create_required_table.html
https://python.itversity.com/19_project_web_scraping_into_database/11_reading_the_data.html
https://python.itversity.com/19_project_web_scraping_into_database/12_validating_data.html
https://python.itversity.com/19_project_web_scraping_into_database/13_apply_required_transformations.html
https://python.itversity.com/19_project_web_scraping_into_database/14_writing_to_database.html
https://python.itversity.com/19_project_web_scraping_into_database/15_run_queries_against_data.html
  • Get unique word count for each url

def get_unique_count(words):
    return len(set(words))
url_and_content_df['unique_word_count'] = url_and_content_df.apply(func=lambda cols: get_unique_count(cols['content'].split(' ')), axis=1)
url_and_content_df
url content content_length word_count unique_word_count
0 https://python.itversity.com/01_overview_of_wi... \n\n\n\nOverview of Windows Operating System¶\... 233 25 20
1 https://python.itversity.com/01_overview_of_wi... \n\n\n\nGetting System Details¶\nLet us unders... 463 73 56
2 https://python.itversity.com/01_overview_of_wi... \n\n\n\nManaging Windows System¶\nLet us under... 475 65 56
3 https://python.itversity.com/01_overview_of_wi... \n\n\n\nOverview of Microsoft Office¶\nAs IT P... 573 79 53
4 https://python.itversity.com/01_overview_of_wi... \n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n 39 5 5
... ... ... ... ... ...
165 https://python.itversity.com/19_project_web_sc... \n\n\n\nReading the data¶\n\n\n\n\n\n 27 3 3
166 https://python.itversity.com/19_project_web_sc... \n\n\n\nValidating data¶\n\n\n\n\n\n 26 2 2
167 https://python.itversity.com/19_project_web_sc... \n\n\n\nApply required transformations¶\n\n\n\... 41 3 3
168 https://python.itversity.com/19_project_web_sc... \n\n\n\nWriting to Database¶\n\n\n\n\n\n 30 3 3
169 https://python.itversity.com/19_project_web_sc... \n\n\n\nRun queries against data¶\n\n\n\n\n\n 35 4 4

170 rows × 5 columns

  • Get number of times each word repeated for each url

words = url_and_content_df['content'].str.split(' ', expand=True)
words
0 1 2 3 4 5 6 7 8 9 ... 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935
0 \n\n\n\nOverview of Windows Operating System¶\n\nGetting System Details\nManaging Windows System\nOverview of ... None None None None None None None None None None
1 \n\n\n\nGetting System Details¶\nLet us understand how to get System Details ... None None None None None None None None None None
2 \n\n\n\nManaging Windows System¶\nLet us understand how to manage system effectively. ... None None None None None None None None None None
3 \n\n\n\nOverview of Microsoft Office¶\nAs IT Professionals, it is important to ... None None None None None None None None None None
4 \n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n None None None None None ... None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
165 \n\n\n\nReading the data¶\n\n\n\n\n\n None None None None None None None ... None None None None None None None None None None
166 \n\n\n\nValidating data¶\n\n\n\n\n\n None None None None None None None None ... None None None None None None None None None None
167 \n\n\n\nApply required transformations¶\n\n\n\n\n\n None None None None None None None ... None None None None None None None None None None
168 \n\n\n\nWriting to Database¶\n\n\n\n\n\n None None None None None None None ... None None None None None None None None None None
169 \n\n\n\nRun queries against data¶\n\n\n\n\n\n None None None None None None ... None None None None None None None None None None

170 rows × 3936 columns

s = words.stack().reset_index(drop=True)
word_count = s.groupby(s).agg(['count'])
word_count.query('count >= 50 and count <= 100')
count
# 99
'Friday', 55
'Monday', 52
'Saturday', 55
'Sunday', 51
'Thursday', 66
'Tuesday', 55
'Wednesday', 55
* 95
+ 66
/)\n 56
2 87
2, 71
3, 73
4 76
: 99
Data 100
Dec 53
If 85
Python 77
S 84
\\n 91
all 81
also 54
an 62
at 87
database 54
default 66
each 57
element 53
elements 56
function 96
functions 57
get 91
given 58
how 53
into 69
it 82
key 57
lambda 88
number 60
on 93
one 71
order 68
orders 69
part 64
perform 51
rows 61
set 61
should 68
string 79
such 81
that 82
type 77
typically 50
use 95
value 65