Web scrapping from 4, 00000 Product ID’s provided by the client
The objective of the project was to scrape data from the product ids and the links provided by the client. The scrapped data has to be added to a database and has to be compared with various factors and had to provide a vivid analytical view for the client.
-
A table was given to us by the clients which have either SKU’s, Product ID or EAN ( European Article
DB
Number) - For repetitive tasks, a program was written in PHP cron so that links will be attached to the specific ID’s given by the client.
- The URL will then be passed through PHP cURL and a proxy server so that HTML can be fetched continuously without any disruption
- Fetched HTML code is passed through PHP Simple HTML DOM to acquire the required data
- Acquired data has been added to a spreadsheet in DB with values 0 and 1 (o indicates the task not completed and 1 indicates vice versa)
- Another program was written in Cron to make the scrapping process faster. Forking was done to spawn subprocesses from a parent process so that both programs can run simultaneously.
- If there is any dead link or there is an error in it, the program will skip to another link after a specific amount of time. Meanwhile, subprocess will continue and will again recheck the link to ensure maximum efficiency
- An interface was created in .NET for us to review the progress of the scrapping as well as an analytical graph for comparison of the products to ensure that there is no flaw.