Friday, October 6, 2017

Scraping StaticGen.com with Python

This is a short Python Web Scraping Tutorial

We will learn how to scrape data from this website StaticGen and have the data saved into a CSV file for further processing.

Let's get our hands dirty...

Prerequisite

You should already have at least some basic knowledge of the following:-
1- HTML and CSS
2- Python


Inspecting the site's HTML structure

Load the website on the browser and study the html structure. Use what ever tool or browser for this, I used Google Chrome browser and it looks like below:-




Ctrl+U to view source code





FireBug to view html code




Html Inspector (Ctrl+Shift+I)





As you may have noticed, the required dataset to be scrapped are arranged in rows and columns on the web page. This makes things easier since we have a consistent pattern to follow.


Now, there are many libraries in python that you can use to scrape and clean dataset from such website. Some of the libraries include: Requests, Beautifulsoup, Selenium, Pandas etc

Each of the libraries have slightly different approach to extracting dataset from a web page. For example, if you are to use Selenium library, you will have to download and configure the Selenium web browser driver you are using.

In most cases, you will use a combination of different libraries to fully complete a task. In this tutorial, am going to use the combination of Selenium and Pandas to complete the scrapping task.


Web Scraping with Python Selunium and Pandas

You can install Selenium and Pandas on your python environment simply by running: pip install <packageName>

More details about this and more can be found on their respective official websites:
Selenium = http://www.seleniumhq.org
Pandas = http://pandas.pydata.org

Now, I will assume you have installed and configured the above modules. So let's start scrapping...



looking closely, you will see that the xPath changes incrementally. Following the pattern, I created a text file listing all the xPaths while skipping the odd (4th) xPath. This text file is now read line by line and parse into find_element_by_xpath() method.

This may not be the most effective approach, but it gets the job done.

Here is the code:-

from selenium import webdriver
import pandas as pd

# Creates an instance driver object...
driver = webdriver.Chrome('C:\\Users\\user\\Documents\\Jupyter_Notebooks\\chromedriver_win32\\chromedriver.exe')

url = 'https://www.staticgen.com/'

xp = pd.read_csv('xpaths.txt', header=None)

# Get the url
driver.get(url)

# ====================================

ele_list = []
i = 0
for x in xp[0]:
    # Find and get elements
    ele = driver.find_element_by_xpath(xp[0][i])
    ele_list.append(ele)
    
    i = 1+i

# ====================================
lang_text_list = [] for lang in ele_list: lang_text = lang.text lang_text_list.append(lang_text) lang_text_df = pd.DataFrame(lang_text_list) lang_text_df.to_csv("siteGen.csv")


Alternative code using: Requests and BeautifulSoup libraries

# Import the libaries
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "http://www.staticgen.com/"
raw_html = requests.get(url)

# Get the text
txt = raw_html.text

# Get the content
content = raw_html.content

# make the content more clearner
soup = BeautifulSoup(content, 'html.parser')

# print (soup.prettify())

# find all 'li' tags with the class

all_li = soup.find_all('li', {"class":"project"})

# test the length
# len(all_li)

# Clean just one/individual object
all_li[0].find('h4', {'class':"title"}).text

# =================================

# Lets iterate over the all_li object
items_title = []
items_url_title = []


for item in all_li:
    try:
        title = item.find('h4', {'class':"title"}).text # Tiles
        url_title = item.find('h6', {"class":"url"}).text # URLs
#         print (item.find('dl')) # Language
                
        items_title.append(title)
        items_url_title.append(url_title)

    except AttributeError:
        continue
#         print (item.find('h4', {'class':"title"}).text)
# =================================
df_title = pd.DataFrame(items_title) df_url = pd.DataFrame(items_url_title) df_title[1] = df_url
# =================================
df_title[2] = lang_text_df df_title.to_csv("siteGen.csv")



Thank you for following.

No comments:

Post a Comment