Thursday, September 2, 2021

Nairaland website Data Scraper

₦airaLand.com Data Scraper Bot

In this information age, the need and importance of extracting data from the web is becoming increasingly obvious.

Over the years attempts has been made to duplicate nairaland website structure by developers using different programming languages such PHP, Python, .NET, Perl, Ruby, C#, Java etc. But unfortunately little or no attempt has been made to scrape or extract useful data from the forum for legitimate purpose.

If you know nairaland.com, then I don't need to tell you that it is the equivalent of Facebook or Twitter for Nigerians that houses abundant information related to Nigeria and environs. So, as a data person, you know what that means!

If you want to measure the opinion of Nigerians online, don't use data from sources like facebook or twitter, instead use the data from Nairaland. As a the moment, nairaland.com has about 1.5 million active user accounts (90% of them are Nigerians residing in the country) and more than 3 million topics on different subjects has been created.

The problem now is, how to extract these data legally and freely without breaking the site and you pocket.

Off course, you can always copy, paste and edit contents from any section of the forum. But in situations were you have to do this repeatedly, then you need a way to automate the process to ease your task.

Imagine if you have to copy the title of the topics that made front page everyday, you will select all the content and copy paste in a text editor to edit into a friendly format. That is how you will do it every single day! Won't it be nice if you have a script/program that does that for you with just a mouse click?


Legal Warning before Scraping a website

There are a few points that we need to go over before we start scraping.

~ Always check the website’s terms and conditions before you scrape them. They usually have terms that limit how often you can scrape or what you can you scrape
~ Because your script will run much faster than a human can browse, make sure you don’t hammer their website with lots of requests. This may even be covered in the terms and conditions of the website.
~ You can get into legal trouble if you overload a website with your requests or you attempt to use it in a way that violates the terms and conditions you agreed to.
~ Websites change all the time, so your scraper will break some day. Know this: You will have to maintain your scraper if you want it to keep working.
~ Unfortunately the data you get from websites can be a mess. As with any data parsing activity, you will need to clean it up to make it useful to you.

With that out of the way, let’s start scraping!

Lets get started scraping data from nairaland.

Here I present to you a solution that allows you scrape or extract the following datasets from Nairaland:-
1) Front Page Topics
2) Members and Guests Online
3) Section Topics and poster usernames
4) First Post content (Original Post) from thread
5) Images from thread
6) Email Addresses from thread


How to use the data

What can the scraped data be used for?
Data from Nairaland can be used for Data Science, Machine Learning, Computer Vision etc as follows:-

~ Text mining
~ Sentiment Analysis
~ Natural Language Processing
~ Polls and Opinions Study
~ Trends Analysis
~ Market Research
~ Automatic summarization
~ Machine translation
~ Named entity recognition
~ Relationship extraction
~ Sentiment analysis
~ Speech recognition
~ Words embedding
~ Topic segmentation
~ etc.


Understand Nairaland Structure

The structure of the website has pretty much remained the same for some years now. See the look of the forum in 2005, 2011, 2014, and 2017.

Year 2005


Year 2011



Year 2014



Year 2017


This means a web crawler script written for nairaland will remain functional for a long time, until the structure is changed.

Also, the HTML structure uses a lot of tables. Most of the data we will be scrapping are nested in html table structures.

Requets, BeautifulSoup and re are the primary python modules used for this web scrapping. Lets see the codes:-

1) Front Page Topics
url = 'https://www.nairaland.com'

# Use requests to read page html...
res = requests.get(url)
html_data = res.text

# Use Bs4 to parse the html...
soup = BeautifulSoup(html_data, 'html.parser')
table = soup.find_all('table', {'class':'boards'})
# The contents on the front page are in tables, the front page topics are in a td with 'class':'featured w'

soup = BeautifulSoup(html_data, 'html.parser')
table = soup.find_all('td', {'class':'featured w'})

front_page_topics = table[0].find_all('a')

for tp in front_page_topics[:65]:
    print(tp.text, tp['href'])
    print('---------')


2) Members and Guests Online
url = 'https://www.nairaland.com'

# Use requests to read page html...
res = requests.get(url)
html_data = res.text

# Use Bs4 to parse the html...
soup = BeautifulSoup(html_data, 'html.parser')
table = soup.find_all('table')

# The Members and Guests Online are in the 4th column...
table[3].text


3) Section Topics and poster usernames
# Let use the programming section  as a case study...
url = 'https://www.nairaland.com/programming'

# Use requests to read page html...
res = requests.get(url)
html_data = res.text

# Use Bs4 to parse the html...
soup = BeautifulSoup(html_data, 'html.parser')
table = soup.find_all('table')

# The Section Topics are in the 3th column...
display(table[2])

rows = table[2].find_all('td')

for a in rows[0].find_all('a'):
    try:
        if a['href'].startswith('/'):
            print(a.text)
            print(a['href'])
            print('------------')
    except Exception:
        pass



4) First Post content (Original Post) from thread
url = 'https://www.nairaland.com/6674244/youthful-woman-celebrates-80th-birthday'

# Use requests to read page html...
res = requests.get(url)
html_data = res.text

# Use Bs4 to parse the html...
soup = BeautifulSoup(html_data, 'html.parser')
table = soup.find_all('table', {'summary':'posts'})

post = table[0].find_all('div', {'class':'narrow'})
post[0].text


5) Images from thread
url = 'https://www.nairaland.com/6674244/youthful-woman-celebrates-80th-birthday'

# Use requests to read page html...
res = requests.get(url)
html_data = res.text

# Use Bs4 to parse the html...
soup = BeautifulSoup(html_data, 'html.parser')
images = soup.find_all('img')

images_src = [ img['src'] for img in images ]
images_src


6) Email Addresses from thread
url = 'https://www.nairaland.com/6631572/nairaland-fantasy-premier-league-2021'

# Use requests to read page html...
res = requests.get(url)
html_data = res.text

emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", html_data)
clean_emails = list(set(emails)) # Remove duplicates...
clean_emails

You can download the entire code from this Jupyter notebook.

Thanks for reading and happy Scrapping.

No comments:

Post a Comment