July 30, 2018
Did it take the trouble to copy the content of the web page and extract that data directly to your local computer? Or ever gave a thought, how the data is extracted from millions of URLs?
If you are wondering what could be the possible process behind this technique? This article will provide you with the essential information.
Web Scraping and how it works?
Web scraping is also known as data scraping or data extraction technique. This software application is used to extract information from the websites and WebPages. The main focus of this technique is to transform the unstructured data (typically in HTML format) into structured data (useful data).
The work of the web scraper is carried out by a code called ‘scraper’. To gather the useful data from the HTML document, it first sends a GET query using HTTP protocol to a targeted website and then based on the result received it allows you to read the HTML of that web page which you can store on your computer and then shows the result which you are looking for.
Web scraping can be performed by various methods which include every programming language. To make web scrapping easier Python programming language is used. As the Python provides more ease of use and work environment, web scraping using Python can be very effective.
Web Scraping Vs Web Crawling
The basic difference between the terms can be easily defined by their names itself. Scraping is generally meant scraping or extracting the data from the specified websites whereas, Crawling means to crawl and look for the numerous websites content and then index accordingly in the search engine.
They are used to build an index page for the user and show the useful websites URLs by indexing them. There is no need of web crawling if you want to do web scraping but if you are doing web crawling there is a need of a small portion of web scraping.
Introduction to Web Scraping using Python
For web scraping technique an open source web crawling framework is used. This tool is known as Scrapy which is built on the Python library. As this tool is easy and has a fast access to a library, it can be very useful for web scraping. We can also use beautiful soap which is a library to extract XML or HTML.
Let us start with web scraping with the help of an example. Suppose there are 10 fastest cars in the world and we like to see the top 5 fastest cars based on their views and popularity. We’ll see which sports car has greater views and followers.
We’ll use Python 3 and Python virtual environments for this example. Through web scraping and Python, it can be very easy to achieve. Web scraping selects some of the data that you’ve downloaded from the web and passes it along the other process.
To start with the web scraper, you need to set the virtual environment for the Python 3. Use this method to set the following.
You’ll also need to install these packages using pip.
To perform an HTTP request, a requests package has to be installed.
To handle all the HTML processing BeautifulSoap4 has to be installed.
Use this code to install these packages.
After installation of these packages, create your file as cars.py and also include these import statements at the top.
The first step is to download the web pages, thus requests package provides the help. This package will help you to do all the tasks of HTTP in Python. For this example, you are only going to need requests.get () function.
This sentence will help you to get the content of the particular URL by making an HTTP GET request. It will return the text content if the URL is some kind of HTML or XML. If not, it will return none.
If the response will be in HTML, it will return true otherwise it will return false
This function will print the log errors which can be useful for you.
The function simple_get() takes a single URL argument and then makes a GET request to that URL. If everything goes smoothly, it will return the content of that particular URL in a raw HTML. If there would be problems like server down or URL is denied then the function will return none.
After collecting the raw HTML data from the URL you can select and extract the document structure from the raw HTML. We will be using BeautifulSoup for this purpose. BeautifulSoap will produce a structured document of the Raw HTML by parsing them. To see how the BeautifulSoap works let us take a quick example of HTML.
Save this file as example.html. After saving this file you can use BeautifulSoup as:
If we break down this example, we’ll see that the raw HTML data was passed through BeautifulSoup constructor. The html.parser is the second argument supplied here. BeautifulSoup accepts different back-end parser but only the standard back-end parser is html.parser.
By using select() method it will let you use CSS selectors. To locate the elements in the document, this method is used in the html object. In the given example, html.select(‘p’) returns a list. As in the line if p[‘id’] = =’car’, ‘p’ has an HTML attribute which can be accessed like a directory. The <p id=”car”> attribute in HTML corresponds to id attribute is equal to string ‘car’.
Now, it is time to provide the information to the select () function. When you’ll see the names of the car in your web browser, the name appears in <li> tag and inside this tag, there is a car’s name. Generally, we’ll look for the class element or id element attributes or any other source which provides the unique identification of the information which we want to extract.
You can search the top fastest cars on your web browser and then examine their attributes. Let us consider this look with python.
In these sentences, there are various names which are separated by a newline character. Keeping this in mind, you can extract their names in a single list. You can use this code to generate the list.
This sentence will find the list of the cars and download that specific page and returns a list of strings. It will return one car name at a time.
# This syntax will raise an exception if there is a failure in retrieving the data from the url.
The get_names () function will download the page and get the name of <li> elements and iterates over. To ensure there are no duplicates names, you can add each name in python set and convert this set into a list and returns it.
Now we have a list of the names and the last thing to do is to gather their views and followers. The code to be used to get the number of views is similar to the code which we have used to get the list of names. In this function, we have to provide the name of the car and the pick the integer value from the web page.
For reference, you can view the example page in the browser’s developer tools. There you can find the text appearance in <a> element which has a href attribute that contains a substring ‘latest-40’. You can start with the function as:
This syntax will accept the name of the car and returns a number of hits or insights on that specific page of the car name. The hits on the page will be received in integer form from the last 40 days as ‘int’
The last step would be to find the simple errors in the retrieval of data. To find out the proper structure of the data from an unstructured data can sometimes be messy. So it is wise to keep the track of the errors in this retrieval of data. You can also print the message which shows there are number of cars which were left out from the ranking list. You can write a code as follows:
Everything has done, all that left is to run the script and find the detailed report of the following codes.
Let us take a quick review of what we have completed. First, we have created a list of car names. Second, we have run the iteration on the list of the name individually to generate the number of hits or their popularity.
Third, we have finished the script by sorting the car names with their number of views. After all these things, run the script and review your output.
These are the list of the top 5 cars which are most popular among the people. We are pretty much sure that you’ve learned how the web scraping works and how it can be used with python.