Web scraping is one of the effective ways to resolve the “cold start” problem in big data system developments. Written in Python, Scrapy is the Swiss Army knife for this task.
Workflow
With Scrapy handling the procedural part, our web scraping code can 100% focus on the parsing logic. The actual code in some way looks similar to asynchronous callbacks in JavaScript.
The parsing workflow is dynamic, following the branching of actual page content. Scrapy automatically schedules the order of HTTP requests following the to-dos in its queue and user configurations on rates and limits.
Storage
Scrapy offers an interface allowing the definition of data type, called “Items,” which functions as the endpoint of a scraping workflow: once the scraping logic returns an Item, no further HTTP requests will be spawned on this branch.
Using Items, Scrapy separates the “last-mile” logic that stores Item into user-defined backends. In Scrapy, this is the definition of “Pipeline” that processes Items. The introduction of Pipeline facilitates the communication between scraper and data consumer that are eventually unified into a self-served big data system.
Conformity
Web scraping can sound dirty to some people because it is usually the worst way of data retrieval from providers. For this reason, Scrapy is integrated with “robots.txt” check to make sure our scraping conforms to the data provider’s robot policy to avoid any legal issues unintentionally triggered.
