What we did for the project
• Our developer will manually feed the system with links that need to be scrapped
• However, no two websites will have a similar structure. So specification of various websites has to be entered manually to the system so that it can fetch data accordingly.
• Websites have to be manually checked to see whether they allow scrapping or not.
• For repetitive scrapping from a page, a program was written in the system so that unchecked or links with error will be crawled after a certain interval of time.
• The system has the capacity to crawl and scrape multiple websites. For us, the system had to crawl 16 websites simultaneously and upon ranking.
• Ranking can be set in the pipeline of the system.
• Developed a parser that will break down the scrapped data.
• If there is an issue with the link or even crawling, notification will be sent and will restart the crawling process again from where it has stopped.
• Forking was done to spawn subprocesses from a parent process.
• An interface was created to review the data collected.
• Acquired data has been added to a database.
• Developed a program that will send mail to concerned people once a program has ended.