## Analyzing Website Data using Pandas

As we understood how to get the data from website using BeautifulSoup, let us go ahead and perform few scenarios to validate data using Pandas.
* Create data frame using `url_and_content_list`.
* Get length of content for each url.
* Get word count for each url.
* Get the list of pages with content less than 30 words.
* Get unique word count for each url.
* Get number of times each word repeated for each url.

In [1]:
%run 08_processing_data_using_data_frame_apis.ipynb

CPU times: user 5.55 s, sys: 80.1 ms, total: 5.63 s
Wall time: 7.76 s
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html : 233
https://python.itversity.com/01_overview_of_windows_os/02_getting_system_details.html : 463
https://python.itversity.com/01_overview_of_windows_os/03_managing_windows_system.html : 475
https://python.itversity.com/01_overview_of_windows_os/04_overview_of_microsoft_office.html : 573
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html : 39
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html : 41
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html : 38
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html : 28
https://python.itversity.com/04_postgres_database_operations/01_postgres_database_operations.html : 504
https://python.itversity.com/04_postgres_database_operations/02_overview_

* Create data frame using `url_and_content_list`

In [2]:
import pandas as pd

url_and_content_df = pd.DataFrame(url_and_content_list, columns=['url', 'content'])

* Get length of content for each url

In [3]:
url_and_content_df['content_length'] = url_and_content_df['content'].str.len()

In [4]:
url_and_content_df

Unnamed: 0,url,content,content_length
0,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Windows Operating System¶\...,233
1,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nGetting System Details¶\nLet us unders...,463
2,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nManaging Windows System¶\nLet us under...,475
3,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Microsoft Office¶\nAs IT P...,573
4,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n,39
...,...,...,...
165,https://python.itversity.com/19_project_web_sc...,\n\n\n\nReading the data¶\n\n\n\n\n\n,27
166,https://python.itversity.com/19_project_web_sc...,\n\n\n\nValidating data¶\n\n\n\n\n\n,26
167,https://python.itversity.com/19_project_web_sc...,\n\n\n\nApply required transformations¶\n\n\n\...,41
168,https://python.itversity.com/19_project_web_sc...,\n\n\n\nWriting to Database¶\n\n\n\n\n\n,30


* Get word count for each url

In [5]:
url_and_content_df['word_count'] = url_and_content_df['content'].str.split(' ').str.len()

In [6]:
url_and_content_df.sort_values('word_count')

Unnamed: 0,url,content,content_length,word_count
117,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nPolymorphism¶\n\n\n\n\n\n,23,1
113,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nConstructors¶\n\n\n\n\n\n,23,1
114,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nMethods¶\n\n\n\n\n\n,18,1
115,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nInheritance¶\n\n\n\n\n\n,22,1
116,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nEncapsulation¶\n\n\n\n\n\n,24,1
...,...,...,...,...
130,https://python.itversity.com/15_overview_of_pa...,\n\n\n\nJoining Data Frames¶\nLet us understan...,20249,1975
106,https://python.itversity.com/13_understanding_...,\n\n\n\nRow level transformations using map¶\n...,23527,2105
42,https://python.itversity.com/07_pre_defined_fu...,\n\n\n\nString Manipulation Functions¶\nLet us...,15592,3724
123,https://python.itversity.com/15_overview_of_pa...,\n\n\n\nData Frames - Basic Operations¶\nHere ...,18321,3752


* Get the list of pages with content less than 30 words

In [8]:
for url in url_and_content_df.query('word_count <= 30')['url']:
    print(url)

https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html
https://python.itversity.com/07_pre_defined_functions/01_pre_defined_functions.html
https://python.itversity.com/12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
https://python.itversity.com/13_understanding_map_reduce_libraries/08_using_groupby.html
https://python.itversity.com/13_understanding_map_reduce_libraries/09_limitations_of_map_reduce_libraries.html
https://python.itversity.com/14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
https://python.itversity.com/14_overview_of_object

* Get unique word count for each url

In [82]:
def get_unique_count(words):
    return len(set(words))

In [84]:
url_and_content_df['unique_word_count'] = url_and_content_df.apply(func=lambda cols: get_unique_count(cols['content'].split(' ')), axis=1)

In [85]:
url_and_content_df

Unnamed: 0,url,content,content_length,word_count,unique_word_count
0,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Windows Operating System¶\...,233,25,20
1,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nGetting System Details¶\nLet us unders...,463,73,56
2,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nManaging Windows System¶\nLet us under...,475,65,56
3,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Microsoft Office¶\nAs IT P...,573,79,53
4,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n,39,5,5
...,...,...,...,...,...
165,https://python.itversity.com/19_project_web_sc...,\n\n\n\nReading the data¶\n\n\n\n\n\n,27,3,3
166,https://python.itversity.com/19_project_web_sc...,\n\n\n\nValidating data¶\n\n\n\n\n\n,26,2,2
167,https://python.itversity.com/19_project_web_sc...,\n\n\n\nApply required transformations¶\n\n\n\...,41,3,3
168,https://python.itversity.com/19_project_web_sc...,\n\n\n\nWriting to Database¶\n\n\n\n\n\n,30,3,3


* Get number of times each word repeated for each url

In [95]:
words = url_and_content_df['content'].str.split(' ', expand=True)
words

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3926,3927,3928,3929,3930,3931,3932,3933,3934,3935
0,\n\n\n\nOverview,of,Windows,Operating,System¶\n\nGetting,System,Details\nManaging,Windows,System\nOverview,of,...,,,,,,,,,,
1,\n\n\n\nGetting,System,Details¶\nLet,us,understand,how,to,get,System,Details,...,,,,,,,,,,
2,\n\n\n\nManaging,Windows,System¶\nLet,us,understand,how,to,manage,system,effectively.,...,,,,,,,,,,
3,\n\n\n\nOverview,of,Microsoft,Office¶\nAs,IT,"Professionals,",it,is,important,to,...,,,,,,,,,,
4,\n\n\n\nOverview,of,Editors,and,IDEs¶\n\n\n\n\n\n,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,\n\n\n\nReading,the,data¶\n\n\n\n\n\n,,,,,,,,...,,,,,,,,,,
166,\n\n\n\nValidating,data¶\n\n\n\n\n\n,,,,,,,,,...,,,,,,,,,,
167,\n\n\n\nApply,required,transformations¶\n\n\n\n\n\n,,,,,,,,...,,,,,,,,,,
168,\n\n\n\nWriting,to,Database¶\n\n\n\n\n\n,,,,,,,,...,,,,,,,,,,


In [119]:
s = words.stack().reset_index(drop=True)

In [120]:
word_count = s.groupby(s).agg(['count'])

In [130]:
word_count.query('count >= 50 and count <= 100')

Unnamed: 0,count
#,99
"'Friday',",55
"'Monday',",52
"'Saturday',",55
"'Sunday',",51
"'Thursday',",66
"'Tuesday',",55
"'Wednesday',",55
*,95
+,66
