Thursday, August 18, 2016

Data Srapping, Analysis and Visualization with Python

Hello there,

From the title of this post, you should already know that we are about to explore the concepts of Data Science. That is to say: we are going to mung for data, clean the data and try to derive meaningful information from the dataset.

In other for us to be comfortable in carrying out the above task, we need some kind of Hacking skills, Math & statistics knowledge and Substantive expertise in the data we intend to use.

Ok, don't worry am going to keep things simple so everyone can follow along. And the dataset we are going to use is ready available for the public (so no much hacking to get the dataset!).

Dataset and Python libraries

The dataset we are going to use is the "Birthday list" on home page (i.e: NairaLand Forum Members' Birthday Data). We will attempt to answer some useful questions on the dataset.

We will make use of the following python libraries/modules/packages (any of these three names you used is accepted, however in the article am going to use the name "libraries"). So lets see the python libraries and what we are going to use them for:-
1) re, requests, BeautifulSoup: libraries for Scraping and Cleaning the data
2) pandas, datetime: libraries for Analyzing and Visualizing the data

Lets get started...

First thing first is to import all the libraries we are going bto use:-

# libraries for Scraping and Cleaning the data
import re
import requests
from bs4 import BeautifulSoup

# libraries for Analyzing and Visualizing the data
import pandas as pd
from datetime import datetime

Scrap birthday data from Nairaland home page

In this section, you can copy out the birthdays list and clean it by hand. But if you are to repeat this process every day for one year, then you will surely like to automate the data scrapping and cleaning process.

Here is how to go about automating the process in python, an important stage in Data Science often refered to as web scrapping and information extraction (data collection).

The birthdays list on the website are currently in this format: rodbel(29), Sirolad(29), mokei(27). So we have to clean the dataset into a tabular format useful in python.
# Scraping out the raw html code of nairaland home page
url = ""
raw_html = requests.get(url) # returns the complete url html code

# print (raw_html.text)
raw_data = raw_html.text  # save the text in an object

soup_data = BeautifulSoup(raw_data, "lxml") # use BeautifulSoup module read the html into xml to and save it in an object

# lets display only the part of the data we need. It is contained in the cell of table tag ()

Clean the data into a friendly format

Up to this point our raw dataset still contains some html tags and some unwanted text. Let extract all irrelevant text and keep only the birthday list in the format of: Username, age. To be saved in a CSV file.

# lets read out the text only ignoring the tag cell in a table
for data in soup_data("td"):
    print (data.text)

# Obviously, we don't need every text above. So use the 're' module, to extract only the relevant birthday list

# Note: I will ignore those members whose ages are not displayed, so that we don't have to deal with NaN values in our data

member_found = None

re_match = "[\w]+\([\d]+\)" # any word count+1 followed-by '(' followed-by any number count+1 followed-by ')'

for data in soup_data("td"):
    data_found = re.findall(re_match, data.text)
    if data_found:
        member_found = data_found

print (member_found)

# Lets further clean up the list to seperate Usernames from age

# Use list comprehension to replace the last brace ")" with empty "" in member_found above

member_found_replaced = [x.replace(")", "") for x in member_found]            # replaces ")" by ""

print (member_found_replaced)

# Now split "member_found_replaced" based on '(' between the usernames and age
# we use for loop to loop through each item of the "member_found_replaced" list above

for y in member_found_replaced:
    member_cleaned = y.split("(")
    print (member_cleaned)
# what we have "member_cleaned" is individual list with two elements each
# lets combine all the lists into a dictionary

# we first declare "member_cleaned" as empty dictiory, so we can append individaul list above into it

member_cleaned = {}

for y in member_found_replaced:
    temp_data = y.split("(")
    member_cleaned[temp_data[0]] = int(temp_data[1])
print (member_cleaned)

# covert the dictionary "member_cleaned" above into a Pandas DataFrame
# Note: in python 3, we have to convert the dictionary items into a list to work with Pandas DataFrame

# define the column names
columns_name = ["Username", "Age"]

# df = pd.DataFrame(member_cleaned.items(), columns = columns_name )   # this is for python 2
df = pd.DataFrame(list(member_cleaned.items()), columns = columns_name )


# Lets add a column for today's date
# using the datetime module

todays_date =

df["Date"] = todays_date


# Let save the dataframe into csv file
# we name the csv file with the current date, i.e: 14/08/2016 will be 20160814 for the file name

csv_name = todays_date.strftime("%Y%m%d")

df.to_csv(csv_name + ".csv")

Analyze and Visualize the data

To Analyze and Visualize our data, below are some of the questions we are going to answer:-
a) How many members are celebrating their birthdays today?
b) Who is the oldest and youngest member celebrating his/her birthdays today?
c) What is the average age the celebrants?
d) How old will each celebrant be in 10years?
e) How old was each celebrant when NairaLand was established?

# Checking the statistical summary of the age column

# First 10 nOldet members celbrating
df.sort_values(by="Age", ascending=False)[:10]

# First 10 youngest members celebrating
df.sort_values(by="Age", ascending=True)[:10]

# to answer, How old will each celebrant be in 10years?
df["Age_10_Plus"] = df["Age"] + 10


# age at 2005 when NairaLand was established
df["Age_at_2005"] = df["Age"] - 11


Lets do some plottings on the "First 10 youngest members celebrating"

# First 10 youngest members celebrating
youngest_10 = df.sort_values(by="Age", ascending=True)[:10]

# To display the plot within the Jupyter notebook
%matplotlib inline

youngest_10.plot(x="Username", y="Age", kind="bar", title="10 Youngest Members Celebrating")

youngest_10.plot(x="Username", y="Age", kind="barh", title="10 Youngest Members Celebrating")

# Lets find the sum of the ages
sum_youngest_10 = youngest_10["Age"].sum()

# Lets find the percentage of each first 10 youngest members and save it in a new column "Percentage"
youngest_10["Percentage"] = (youngest_10["Age"] * 100) / (sum_youngest_10)

# Nowlets check the new dataframe first 10 youngest members

# to plot the pie chat of the Percentage column above
youngest_10["Percentage"].plot.pie(autopct='%.2f', fontsize=15, figsize=(6, 6), title="Pie Chart for 10 Youngest Members Celebrating")

# box plot on df for the three columns, if there are outliers you will see them
"""In statistics, an outlier is an observation point that is distant from other observations.
An outlier may be due to variability in the measurement or it may indicate experimental error; 
the latter are sometimes excluded from the data set."""

# Area plot, just to compare the three colums

You can use the links below to download or view the above analysis in a Jupyter note book.

Here are some important links:-
1) You can download the NoteBook from here.
2) You can view it online here.
3) You can view it on GitHub here.

Thank you for reading.

No comments:

Post a Comment