In this post, I will work through the process of extracting data from The "Webometrics Ranking of World Universities" as requested by my client.
The website has a table with thousands of records representing the ranking system for the world's universities as seen above.
Note that the following three columns ('University', 'Det.' and 'Country') contains hyperlinks. We would link to get those hyperlinks as well instead of just the icons.
import requests url = 'http://webometrics.info/en/world' response = requests.get(url) html = response.text
Alternatively, if you want to use selenium, it will look like this.
from selenium import webdriver driver = webdriver.Chrome('chromedriver.exe') driver.get(url) html = driver.page_source
Whichever method you adopt, you will arrive at the same result if you inspect the respect html tags properly. In this guide, I will use the version from requests module not selenium.
Now that we have the html in a variable lets extract the table. A quick approach will be to use the pandas.read_html(html_page) method, but the only problem with this is that it won't give us the embedded hyperlinks as already stated above.
uni_table = pd.read_html(html) uni_table[0]
As you can see from above, two columns (Det. and Country) contains NaN values, we don't want that. To overcome this challenge, we have to use BeautifulSoup to get all the details we wanted.
soup = BeautifulSoup(html, 'html.parser') table = soup.find_all("table") # this returns a list of all the table on the page rows = table[0].find_all('tr') print (len(rows)) # check the length of the rows # remove the table header... rows.pop(0) # del rows[0] # -------------------------- data_list = [] for r in rows: data_dict = {} # print(r.find_all('td')) data_dict['Ranking'] = r.find_all('td')[0].text data_dict['University'] = r.find_all('td')[1].text data_dict['University Link'] = r.find_all('td')[1].find('a')['href'] data_dict['Details'] = 'http://webometrics.info' + r.find_all('td')[2].find('a')['href'] data_dict['Country'] = r.find_all('td')[3].find('img')['src'] data_dict['Impact Rank'] = r.find_all('td')[4].text data_dict['Openness Rank'] = r.find_all('td')[5].text data_dict['Excellence Rank'] = r.find_all('td')[6].text data_list.append(data_dict) # break data_list
This will return a list of dictionaries that we can convert to dataframe and subsiquently save it to spreadsheet file.
cols = ['Ranking', 'University', 'University Link', 'Details', 'Country', 'Impact Rank', 'Openness Rank', 'Excellence Rank'] data_list_df = pd.DataFrame(data_list, columns=cols) data_list_df.to_excel('University_Data.xlsx', index=False) data_list_df
Now, this is just for the first page. To extract other pages, we will wrap the script into a function and use for loop to loop over other pages.
i = 1 df_list = [] for x in range(1, 121): base_url = f'http://webometrics.info/en/world?page={x}' url = base_url.format(x) print('Processing...', url) # ------------------------------------ response = requests.get(url) html = response.text # ------------------------------------ soup = BeautifulSoup(html, 'html.parser') table = soup.find_all("table") # this returns a list of all the table on the page rows = table[0].find_all('tr') print (len(rows)) # check the lenght of the rows # remove the table header... rows.pop(0) # del rows[0] # -------------------------- data_list = [] for r in rows: data_dict = {} # print(r.find_all('td')) data_dict['Ranking'] = r.find_all('td')[0].text data_dict['University'] = r.find_all('td')[1].text data_dict['University Link'] = r.find_all('td')[1].find('a')['href'] data_dict['Details'] = 'http://webometrics.info' + r.find_all('td')[2].find('a')['href'] data_dict['Country'] = r.find_all('td')[3].find('img')['src'] data_dict['Impact Rank'] = r.find_all('td')[4].text data_dict['Openness Rank'] = r.find_all('td')[5].text data_dict['Excellence Rank'] = r.find_all('td')[6].text data_list.append(data_dict) # ------------------------------------ cols = ['Ranking', 'University', 'University Link', 'Details', 'Country', 'Impact Rank', 'Openness Rank', 'Excellence Rank'] data_list_df = pd.DataFrame(data_list, columns=cols) data_list_df.to_excel(str(i) + '_UniData.xlsx', index=False) df_list.append(data_list_df) i += 1 print('Finished....')
That is it!
No comments:
Post a Comment