Sunday, January 9, 2022

Keeping track on some favorite developers websites

 There are many developer authors who publish useful content on their blog on a regular basis.

As a learning fan, it is a great idea to use the skills you learnt from them to keep track of what is new on their blogs.

The two most common ways for achieving this are API and Scrapping. So, you will research if the author's blog has API service and in the case where it doesn't exist then you will think about using web scraping.

The authors I want to lookup in this post are: Renan MouraWilliam Vincent and Flavio Copes

As at the time of writing, the above authors don't have an API implemented on their respective websites, so we will use web scraping to keep track of the latest post on their blogs. So, basically we will write a scrapper to store the data in a file then compare it with feature scraped data to get the latest or newest entries on the blogs.

There are several libraries for scraping websites, here I will use python requests/selenium, beautifulsoup and pandas to get the job done.


Let's get started...


1- Renan Moura


From Renan Moura's  blog, I will like to keep track of the following post variables: Category, title, title url, published date and updated date.

Using requests library, I got "406 Not Acceptable client error response". Which means that there is a bot manager on the server where the website is hosted that prevents bots from accessing the website. To overcome this, we can either use request with a user-agent or selenium to access this website.

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://renanmf.com'
# Get user-agent from: http://www.useragentstring.com/
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("div", {'class':'card-content'})

print(len(article))


import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver


url = 'https://renanmf.com'

driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)

html = driver.page_source


soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("div", {'class':'card-content'})

print(len(article))


From any of the methods above, we can now loop through the columns we wanted as seen below:-

data_list = []
for art in article:
    category = art.find("li", {'class':'meta-categories'}).text
    title_txt = art.find("h2", {'class':'entry-title'}).text
    title_link = art.find("h2", {'class':'entry-title'}).find('a')['href']
    pub_date = art.find("li", {'class':'meta-date'}).text
    updated_date = art.find("li", {'class':'meta-updated-date'}).text
    
    data = category, title_txt, title_link, pub_date, updated_date
    
    data_list.append(data)

# ------------------------
data_list_df = pd.DataFrame(data_list, columns=['Category', 'Title', 'Title URL', 'Published Date', 'Updated Date'])




2- William Vincent


Here, we will get the following post variable: title, title url and published date

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://wsvincent.com/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("li")

# ------------------------


data_list = []

for art in article:
    title = art.find('h2').text
    title_link = art.find('h2').find('a')['href']
    pub_date = art.find('span', {'class':'post-meta'}).text
    
    data = title, title_link, pub_date
    
    data_list.append(data)
    
# ------------------------

    
data_list_df = pd.DataFrame(data_list, columns=['Title', 'Title URL', 'Published Date'])




3- Flavio Copes


Flavio's blog is similar to William Vincent above, we will get the following post variable: title, title url and published date.


url = 'https://flaviocopes.com'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("li", {'class':'post-stub'})
# ---------------

data_list = []

for art in article:
    title = art.find('h4').text
    title_link = art.find('a')['href']
    pub_date = art.find("time", {'class':'post-stub-date'}).text
    
    data = title, title_link, pub_date
    
    data_list.append(data)
    

    
data_list_df = pd.DataFrame(data_list, columns=['Title', 'Title URL', 'Published Date'])

data_list_df    




Happy scrapping!

No comments:

Post a Comment