Geospatial Solutions Expert: 6 Python web scraping libraries you can't afford to ignore

Friday, December 23, 2016

6 Python web scraping libraries you can't afford to ignore

Hello there,

In this post I will share with you some excellent Python Web Scraping Libraries. For the benefit of the reader who doesn't know what web scrapping is, here is quick introduction.

Web Scraping: is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format (source, webharvy.com). Web Scraping is also called Screen Scraping, Web Data Extraction, Web Harvesting etc.

Different programming languages have different techniques for doing web data scrapping. Here I will present to you some web data scrapping tools as used in the Python programming languages.

6 Python Web Scraping Libraries you can't Afford to Ignore

Here is the list:-

1) Selenium:

Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.

2) UrlLib:

urllib is a package that collects several modules for working with URLs. The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

3) Mechanize:

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

4) Requests:

Besides, all the cool kids are doing it. Requests is one of the most downloaded Python packages of all time, pulling in over 7,000,000 downloads every month. Requests is the only Non-GMO HTTP library for Python, safe for human consumption - HTTP for Humans.

5) Splinter:

Splinter is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items.

6) Scrapy:

Scrapy is Fast and Powerful Scraping and Web Crawling Framework. An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Thank you following.

Geospatial Solutions Expert

Friday, December 23, 2016

6 Python web scraping libraries you can't afford to ignore

6 Python Web Scraping Libraries you can't Afford to Ignore

No comments:

Post a Comment