Web Scraping for Big Data with Scrapy

Web scraping is one of the effective ways to resolve the “cold start” problem in big data system developments. Written in Python, Scrapy is the Swiss Army knife for this task.

Workflow

With Scrapy handling the procedural part, our web scraping code can 100% focus on the parsing logic. The actual code in some way looks similar to asynchronous callbacks in JavaScript.

The parsing workflow is dynamic, following the branching of actual page content. Scrapy automatically schedules the order of HTTP requests following the to-dos in its queue and user configurations on rates and limits.

Storage

Scrapy offers an interface allowing the definition of data type, called “Items,” which functions as the endpoint of a scraping workflow: once the scraping logic returns an Item, no further HTTP requests will be spawned on this branch.

Using Items, Scrapy separates the “last-mile” logic that stores Item into user-defined backends. In Scrapy, this is the definition of “Pipeline” that processes Items. The introduction of Pipeline facilitates the communication between scraper and data consumer that are eventually unified into a self-served big data system.

Conformity

Web scraping can sound dirty to some people because it is usually the worst way of data retrieval from providers. For this reason, Scrapy is integrated with “robots.txt” check to make sure our scraping conforms to the data provider’s robot policy to avoid any legal issues unintentionally triggered.

Published by

Unknown's avatar

Ling YANG

Lead consultant at Studio theYANG, an independent web software consulting studio from Montreal, Canada focused on maintenance and support of Python and Linux systems.

Leave a comment