Tuesday, February 9, 2021

Scrape world university data from Webometrics

 In this post, I will work through the process of extracting data from The "Webometrics Ranking of World Universities" as requested by my client.

The website has a table with thousands of records representing the ranking system for the world's universities as seen above.

Note that the following three columns ('University', 'Det.' and 'Country') contains hyperlinks. We would link to get those hyperlinks as well instead of just the icons.

For this reason, we can not use the pandas.read_html(html_page) method, because it won't return the hyperlinks from those columns. So, we have to use the BeautifulSoup library to lookup the hyperlinks from the source html contents after sending a GET request using the requests or selenium library. 

At the end we will save the data into a spreadsheet using the pandas library.

Summary
1) Send a GET to the web pages - Requests or Selenium
2) Extract data from the response html content - BeautifulSoup
3) Format and Save the data to file - Pandas

A quick lookup show that to get to the next page, a page quey string is added to the URL like so: http://webometrics.info/en/world?page=1, http://webometrics.info/en/world?page=1, http://webometrics.info/en/world?page=3, http://webometrics.info/en/world?page=4, etc. The last page is: http://webometrics.info/en/world?page=120 at the time of writing.

Now, lets extract for the first page (http://webometrics.info/en/world?page=0) then use a for loop to extract for all other pages.

1) Send a GET to the web pages - using 'Requests'
import requests

url = 'http://webometrics.info/en/world'

response = requests.get(url)
html = response.text

Alternatively, if you want to use selenium, it will look like this.

from selenium import webdriver

driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)

html = driver.page_source

Whichever method you adopt, you will arrive at the same result if you inspect the respect html tags properly. In this guide, I will use the version from requests module not selenium.

Now that we have the html in a variable lets extract the table. A quick approach will be to use the pandas.read_html(html_page) method, but the only problem with this is that it won't give us the embedded hyperlinks as already stated above.

uni_table = pd.read_html(html)
uni_table[0]

As you can see from above, two columns (Det. and Country) contains NaN values, we don't want that. To overcome this challenge, we have to use BeautifulSoup to get all the details we wanted.

soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all("table") # this returns a list of all the table on the page

rows = table[0].find_all('tr')
print (len(rows)) # check the length of the rows

# remove the table header...
rows.pop(0) # del rows[0]

# --------------------------
data_list = []
for r in rows:
    data_dict = {}
    
    # print(r.find_all('td'))
    
    data_dict['Ranking']         = r.find_all('td')[0].text
    data_dict['University']      = r.find_all('td')[1].text
    data_dict['University Link'] = r.find_all('td')[1].find('a')['href']
    data_dict['Details']         = 'http://webometrics.info' + r.find_all('td')[2].find('a')['href']
    data_dict['Country']         = r.find_all('td')[3].find('img')['src']
    data_dict['Impact Rank']     = r.find_all('td')[4].text
    data_dict['Openness Rank']   = r.find_all('td')[5].text
    data_dict['Excellence Rank'] = r.find_all('td')[6].text

    data_list.append(data_dict)
    
    # break
    
data_list

This will return a list of dictionaries that we can convert to dataframe and subsiquently save it to spreadsheet file.

cols = ['Ranking', 'University', 'University Link', 'Details', 'Country', 'Impact Rank', 'Openness Rank', 'Excellence Rank']

data_list_df = pd.DataFrame(data_list, columns=cols)
data_list_df.to_excel('University_Data.xlsx', index=False)
data_list_df

Now, this is just for the first page. To extract other pages, we will wrap the script into a function and use for loop to loop over other pages.

i = 1
df_list = []

for x in range(1, 121):
    base_url = f'http://webometrics.info/en/world?page={x}'
    url = base_url.format(x)
    print('Processing...', url)

    # ------------------------------------
    response = requests.get(url)
    html = response.text

    # ------------------------------------
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find_all("table") # this returns a list of all the table on the page

    rows = table[0].find_all('tr')
    print (len(rows)) # check the lenght of the rows

    # remove the table header...
    rows.pop(0) # del rows[0]

    # --------------------------
    data_list = []
    for r in rows:
        data_dict = {}

        # print(r.find_all('td'))

        data_dict['Ranking']         = r.find_all('td')[0].text
        data_dict['University']      = r.find_all('td')[1].text
        data_dict['University Link'] = r.find_all('td')[1].find('a')['href']
        data_dict['Details']         = 'http://webometrics.info' + r.find_all('td')[2].find('a')['href']
        data_dict['Country']         = r.find_all('td')[3].find('img')['src']
        data_dict['Impact Rank']     = r.find_all('td')[4].text
        data_dict['Openness Rank']   = r.find_all('td')[5].text
        data_dict['Excellence Rank'] = r.find_all('td')[6].text

        data_list.append(data_dict)

        
    # ------------------------------------
    cols = ['Ranking', 'University', 'University Link', 'Details', 'Country', 'Impact Rank', 'Openness Rank', 'Excellence Rank']
    data_list_df = pd.DataFrame(data_list, columns=cols)
    
    data_list_df.to_excel(str(i) + '_UniData.xlsx', index=False)

    df_list.append(data_list_df)
    
    i += 1

print('Finished....')




That is it!

No comments:

Post a Comment