Secondary Data Collection Techniques — Web Scraping and Crawling
With the development of big data, the amount of data available is countless. Imagine if we have to collect and store millions of data in one file manually by copying and pasting, it must be very tiring, Isn’t it?. Web scraping process for sure can help us to collect the data as much and obviously very quick. That is because the server will do the automation and work of data collection.
So, we can say that web scraping is a process for extracting information and data on websites automatically. The extracted data can be saved into the desired format. We can save the data in text, CSV, or JSON format.
With web scraping, someone can easily for collecting customer data and then they can determine the appropriate marketing strategy. We can also use web scraping for brand monitoring needed such as collecting reviews, feedback from the public about our brands and services. Furthermore, web scraping can also be used to collect other data, such as the need to analyze competitor data. Because no matter the type of business, we will always need to see how our competitors are doing. By doing this, we can use it as a way to continuously improve our business.
The Differences between Web Scraping and Web Crawling
Web Scraping refers to the extraction of data from a website or web page usually this data is extracted into a new file format. For example, data from a website can be extracted into an excel spreadsheet, or CSV. Web scraping can also be done manually by parsing using HTML or XML, although in many cases automation tools can be used to extract the data.
Meanwhile, Web Crawling is the process of using a BOT or spider to read and store all content on a website for the purpose of archiving and indexing search engines like Bing or Google and then adds to the database.
So, although web scraping and web crawling have terms that refer to data extraction, they have different purposes and applications used.
Examples of Web Crawling and Web Scraping Applications for Projects
In the Hydrococo Customer Segmentation project that I worked on during my internship at Pt Kalbe Farma, there was a need to provide location data for Alfamart and Indomaret areas in Jember and East Jakarta. This data can be found in any open source. My choice fell on Google Maps, I used the chrome extension to retrieve the data, then I used google sheets to determine the coordinates of the location (longitude and latitude), and finally did data cleansing in Python.
The next step is, go to google maps and then find the target location, here I use the location “Indomaret Jakarta Timur” as the target location. The goal is that all Indomaret in East Jakarta can be covered by the system. Then activate the instant data scraper extension, after that download or copy the results of the data scraper system.
Furthermore, to determine the location of the coordinates of Alfamart and Indomaret, we simply copy the location of the address, then activate the geocode extension feature in the Add-ons menu of Google Sheets. After that wait and let the system works. Each process is represented in the pictures below. Finally, the data is ready to be used, you can copy the results of the convert system and then perform data cleansing in Python according to the desired format.
Based on Picture 8, we want to retrieve the data table on that link. Hospital bed data are presented which contains the attributes Number, Class, and Total from a specific hospital.
First, the basic concept of web scraping is that we send a request to the url after that the system will process it to the server. If the server agrees, the server will return the request in the form of a list form. The results in the extraction will be the desired result. An example is in the form of titles and descriptions which can later be converted into dataframes. Below is the data entity in .Json format (Picture 9).
The libraries used include urlib and pandas. Use the for if function as shown in the image below (Picture 11), then enter the array object into the list, after that change the list into a dataframe form. Further process can be seen in picture 10 — picture 14.
That’s all about the explanation of web scraping and crawling that I can share, thank you for reading this article.
If you want to see the repository of the web crawling and web scraping process that I have done of project, click the google drive link below. You can download or just view it.