There are many developer authors who publish useful content on their blog on a regular basis.
As a learning fan, it is a great idea to use the skills you learnt from them to keep track of what is new on their blogs.
The two most common ways for achieving this are API and Scrapping. So, you will research if the author's blog has API service and in the case where it doesn't exist then you will think about using web scraping.
The authors I want to lookup in this post are: Renan Moura, William Vincent and Flavio Copes
As at the time of writing, the above authors don't have an API implemented on their respective websites, so we will use web scraping to keep track of the latest post on their blogs. So, basically we will write a scrapper to store the data in a file then compare it with feature scraped data to get the latest or newest entries on the blogs.
There are several libraries for scraping websites, here I will use python requests/selenium, beautifulsoup and pandas to get the job done.
Let's get started...
From Renan Moura's blog, I will like to keep track of the following post variables: Category, title, title url, published date and updated date.
Using requests library, I got "406 Not Acceptable client error response". Which means that there is a bot manager on the server where the website is hosted that prevents bots from accessing the website. To overcome this, we can either use request with a user-agent or selenium to access this website.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://renanmf.com'
# Get user-agent from: http://www.useragentstring.com/
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("div", {'class':'card-content'})
print(len(article))
import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://renanmf.com'
driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("div", {'class':'card-content'})
print(len(article))
From any of the methods above, we can now loop through the columns we wanted as seen below:-
data_list = []
for art in article:
category = art.find("li", {'class':'meta-categories'}).text
title_txt = art.find("h2", {'class':'entry-title'}).text
title_link = art.find("h2", {'class':'entry-title'}).find('a')['href']
pub_date = art.find("li", {'class':'meta-date'}).text
updated_date = art.find("li", {'class':'meta-updated-date'}).text
data = category, title_txt, title_link, pub_date, updated_date
data_list.append(data)
# ------------------------
data_list_df = pd.DataFrame(data_list, columns=['Category', 'Title', 'Title URL', 'Published Date', 'Updated Date'])
Here, we will get the following post variable: title, title url and published date
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://wsvincent.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("li")
# ------------------------
data_list = []
for art in article:
title = art.find('h2').text
title_link = art.find('h2').find('a')['href']
pub_date = art.find('span', {'class':'post-meta'}).text
data = title, title_link, pub_date
data_list.append(data)
# ------------------------
data_list_df = pd.DataFrame(data_list, columns=['Title', 'Title URL', 'Published Date'])
Flavio's blog is similar to William Vincent above, we will get the following post variable: title, title url and published date.
url = 'https://flaviocopes.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
article = soup.find_all("li", {'class':'post-stub'})
# ---------------
data_list = []
for art in article:
title = art.find('h4').text
title_link = art.find('a')['href']
pub_date = art.find("time", {'class':'post-stub-date'}).text
data = title, title_link, pub_date
data_list.append(data)
data_list_df = pd.DataFrame(data_list, columns=['Title', 'Title URL', 'Published Date'])
data_list_df
Happy scrapping!