Analyzing Website Data using Pandas¶
As we understood how to get the data from website using BeautifulSoup, let us go ahead and perform few scenarios to validate data using Pandas.
Create data frame using
url_and_content_list
.Get length of content for each url.
Get word count for each url.
Get the list of pages with content less than 30 words.
Get unique word count for each url.
Get number of times each word repeated for each url.
%run 08_processing_data_using_data_frame_apis.ipynb
CPU times: user 5.55 s, sys: 80.1 ms, total: 5.63 s
Wall time: 7.76 s
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html : 233
https://python.itversity.com/01_overview_of_windows_os/02_getting_system_details.html : 463
https://python.itversity.com/01_overview_of_windows_os/03_managing_windows_system.html : 475
https://python.itversity.com/01_overview_of_windows_os/04_overview_of_microsoft_office.html : 573
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html : 39
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html : 41
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html : 38
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html : 28
https://python.itversity.com/04_postgres_database_operations/01_postgres_database_operations.html : 504
https://python.itversity.com/04_postgres_database_operations/02_overview_of_sql.html : 714
Create data frame using
url_and_content_list
import pandas as pd
url_and_content_df = pd.DataFrame(url_and_content_list, columns=['url', 'content'])
Get length of content for each url
url_and_content_df['content_length'] = url_and_content_df['content'].str.len()
url_and_content_df
url | content | content_length | |
---|---|---|---|
0 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nOverview of Windows Operating System¶\... | 233 |
1 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nGetting System Details¶\nLet us unders... | 463 |
2 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nManaging Windows System¶\nLet us under... | 475 |
3 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nOverview of Microsoft Office¶\nAs IT P... | 573 |
4 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n | 39 |
... | ... | ... | ... |
165 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nReading the data¶\n\n\n\n\n\n | 27 |
166 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nValidating data¶\n\n\n\n\n\n | 26 |
167 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nApply required transformations¶\n\n\n\... | 41 |
168 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nWriting to Database¶\n\n\n\n\n\n | 30 |
169 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nRun queries against data¶\n\n\n\n\n\n | 35 |
170 rows × 3 columns
Get word count for each url
url_and_content_df['word_count'] = url_and_content_df['content'].str.split(' ').str.len()
url_and_content_df.sort_values('word_count')
url | content | content_length | word_count | |
---|---|---|---|---|
117 | https://python.itversity.com/14_overview_of_ob... | \n\n\n\nPolymorphism¶\n\n\n\n\n\n | 23 | 1 |
113 | https://python.itversity.com/14_overview_of_ob... | \n\n\n\nConstructors¶\n\n\n\n\n\n | 23 | 1 |
114 | https://python.itversity.com/14_overview_of_ob... | \n\n\n\nMethods¶\n\n\n\n\n\n | 18 | 1 |
115 | https://python.itversity.com/14_overview_of_ob... | \n\n\n\nInheritance¶\n\n\n\n\n\n | 22 | 1 |
116 | https://python.itversity.com/14_overview_of_ob... | \n\n\n\nEncapsulation¶\n\n\n\n\n\n | 24 | 1 |
... | ... | ... | ... | ... |
130 | https://python.itversity.com/15_overview_of_pa... | \n\n\n\nJoining Data Frames¶\nLet us understan... | 20249 | 1975 |
106 | https://python.itversity.com/13_understanding_... | \n\n\n\nRow level transformations using map¶\n... | 23527 | 2105 |
42 | https://python.itversity.com/07_pre_defined_fu... | \n\n\n\nString Manipulation Functions¶\nLet us... | 15592 | 3724 |
123 | https://python.itversity.com/15_overview_of_pa... | \n\n\n\nData Frames - Basic Operations¶\nHere ... | 18321 | 3752 |
50 | https://python.itversity.com/08_user_defined_f... | \n\n\n\nDoc Strings¶\nDocumentation is one of ... | 16347 | 3936 |
170 rows × 4 columns
Get the list of pages with content less than 30 words
for url in url_and_content_df.query('word_count <= 30')['url']:
print(url)
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html
https://python.itversity.com/07_pre_defined_functions/01_pre_defined_functions.html
https://python.itversity.com/12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
https://python.itversity.com/13_understanding_map_reduce_libraries/08_using_groupby.html
https://python.itversity.com/13_understanding_map_reduce_libraries/09_limitations_of_map_reduce_libraries.html
https://python.itversity.com/14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
https://python.itversity.com/14_overview_of_object_oriented_programming/02_classes_and_objects.html
https://python.itversity.com/14_overview_of_object_oriented_programming/03_constructors.html
https://python.itversity.com/14_overview_of_object_oriented_programming/04_methods.html
https://python.itversity.com/14_overview_of_object_oriented_programming/05_inheritance.html
https://python.itversity.com/14_overview_of_object_oriented_programming/06_encapsulation.html
https://python.itversity.com/14_overview_of_object_oriented_programming/07_polymorphism.html
https://python.itversity.com/14_overview_of_object_oriented_programming/08_dynamic_classes.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/02_problem_statement.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/03_installing_pre-requisites.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/04_overview_of_beautifulsoup.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/05_getting_html_content.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/06_processing_html_content.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/07_creating_data_frame.html
https://python.itversity.com/16_web_scraping_using_beautifulsoup/08_processing_data_using_data_frame_apis.html
https://python.itversity.com/19_project_web_scraping_into_database/02_define_problem_statement.html
https://python.itversity.com/19_project_web_scraping_into_database/03_setup_project.html
https://python.itversity.com/19_project_web_scraping_into_database/04_overview_of_python_virtual_environments.html
https://python.itversity.com/19_project_web_scraping_into_database/05_installing_required_libraries.html
https://python.itversity.com/19_project_web_scraping_into_database/06_setup_logging.html
https://python.itversity.com/19_project_web_scraping_into_database/07_modularizing_the_project.html
https://python.itversity.com/19_project_web_scraping_into_database/08_setup_database.html
https://python.itversity.com/19_project_web_scraping_into_database/10_create_required_table.html
https://python.itversity.com/19_project_web_scraping_into_database/11_reading_the_data.html
https://python.itversity.com/19_project_web_scraping_into_database/12_validating_data.html
https://python.itversity.com/19_project_web_scraping_into_database/13_apply_required_transformations.html
https://python.itversity.com/19_project_web_scraping_into_database/14_writing_to_database.html
https://python.itversity.com/19_project_web_scraping_into_database/15_run_queries_against_data.html
Get unique word count for each url
def get_unique_count(words):
return len(set(words))
url_and_content_df['unique_word_count'] = url_and_content_df.apply(func=lambda cols: get_unique_count(cols['content'].split(' ')), axis=1)
url_and_content_df
url | content | content_length | word_count | unique_word_count | |
---|---|---|---|---|---|
0 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nOverview of Windows Operating System¶\... | 233 | 25 | 20 |
1 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nGetting System Details¶\nLet us unders... | 463 | 73 | 56 |
2 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nManaging Windows System¶\nLet us under... | 475 | 65 | 56 |
3 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nOverview of Microsoft Office¶\nAs IT P... | 573 | 79 | 53 |
4 | https://python.itversity.com/01_overview_of_wi... | \n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n | 39 | 5 | 5 |
... | ... | ... | ... | ... | ... |
165 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nReading the data¶\n\n\n\n\n\n | 27 | 3 | 3 |
166 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nValidating data¶\n\n\n\n\n\n | 26 | 2 | 2 |
167 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nApply required transformations¶\n\n\n\... | 41 | 3 | 3 |
168 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nWriting to Database¶\n\n\n\n\n\n | 30 | 3 | 3 |
169 | https://python.itversity.com/19_project_web_sc... | \n\n\n\nRun queries against data¶\n\n\n\n\n\n | 35 | 4 | 4 |
170 rows × 5 columns
Get number of times each word repeated for each url
words = url_and_content_df['content'].str.split(' ', expand=True)
words
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 3926 | 3927 | 3928 | 3929 | 3930 | 3931 | 3932 | 3933 | 3934 | 3935 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n\n\n\nOverview | of | Windows | Operating | System¶\n\nGetting | System | Details\nManaging | Windows | System\nOverview | of | ... | None | None | None | None | None | None | None | None | None | None |
1 | \n\n\n\nGetting | System | Details¶\nLet | us | understand | how | to | get | System | Details | ... | None | None | None | None | None | None | None | None | None | None |
2 | \n\n\n\nManaging | Windows | System¶\nLet | us | understand | how | to | manage | system | effectively. | ... | None | None | None | None | None | None | None | None | None | None |
3 | \n\n\n\nOverview | of | Microsoft | Office¶\nAs | IT | Professionals, | it | is | important | to | ... | None | None | None | None | None | None | None | None | None | None |
4 | \n\n\n\nOverview | of | Editors | and | IDEs¶\n\n\n\n\n\n | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
165 | \n\n\n\nReading | the | data¶\n\n\n\n\n\n | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
166 | \n\n\n\nValidating | data¶\n\n\n\n\n\n | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
167 | \n\n\n\nApply | required | transformations¶\n\n\n\n\n\n | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
168 | \n\n\n\nWriting | to | Database¶\n\n\n\n\n\n | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
169 | \n\n\n\nRun | queries | against | data¶\n\n\n\n\n\n | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
170 rows × 3936 columns
s = words.stack().reset_index(drop=True)
word_count = s.groupby(s).agg(['count'])
word_count.query('count >= 50 and count <= 100')
count | |
---|---|
# | 99 |
'Friday', | 55 |
'Monday', | 52 |
'Saturday', | 55 |
'Sunday', | 51 |
'Thursday', | 66 |
'Tuesday', | 55 |
'Wednesday', | 55 |
* | 95 |
+ | 66 |
/)\n | 56 |
2 | 87 |
2, | 71 |
3, | 73 |
4 | 76 |
: | 99 |
Data | 100 |
Dec | 53 |
If | 85 |
Python | 77 |
S | 84 |
\\n | 91 |
all | 81 |
also | 54 |
an | 62 |
at | 87 |
database | 54 |
default | 66 |
each | 57 |
element | 53 |
elements | 56 |
function | 96 |
functions | 57 |
get | 91 |
given | 58 |
how | 53 |
into | 69 |
it | 82 |
key | 57 |
lambda | 88 |
number | 60 |
on | 93 |
one | 71 |
order | 68 |
orders | 69 |
part | 64 |
perform | 51 |
rows | 61 |
set | 61 |
should | 68 |
string | 79 |
such | 81 |
that | 82 |
type | 77 |
typically | 50 |
use | 95 |
value | 65 |