Web Scraping with scrapy

Web Scraping with Scrapy: Techniques and Best Practices

Introduction

Web scraping has become an essential tool for various purposes, including data collection, research, and analytics. One popular Python library for web scraping is Scrapy, known for its flexibility and efficiency in extracting data from websites. In this blog post, we'll explore the reasons to choose Scrapy, different techniques, and best practices to make the most out of this powerful tool.

Why Scrapy?

Scrapy is a robust and extensible framework designed for crawling and scraping websites. Here are some key reasons why Scrapy stands out:

  • Asynchronous Processing: Scrapy is built on top of the Twisted asynchronous networking library, allowing it to handle multiple requests concurrently. This leads to faster scraping and improved performance.

  • Modular Design: Scrapy follows a modular architecture, making it easy to extend and customize. It provides a clear separation between the spider logic, item processing, and storage.

  • Built-in Support for Protocols: Scrapy supports HTTP, HTTPS, and FTP protocols out of the box. It also comes with a convenient way to handle cookies, sessions, and user agents.

Different Techniques and Tricks for Using Scrapy

  1. Simple Spider

    Let's start with a basic example. The following Scrapy spider extracts the title and price of books from the website http://books.toscrape.com and follows pagination links.

    
     import scrapy
     class BookSpider(scrapy.Spider):
         name = 'bookspider'
         start_urls = ["http://books.toscrape.com"]
    
         def parse(self, response):
             self.log(f"I just visited {response.url}")
    
             for article in response.css('article.product_pod'):
                 yield {
                     'title': article.css("h3 > a::attr(title)").extract_first(),
                     'price': article.css(".price_color::text").extract_first()
                 }
    
             next_page_url = response.css("li.next > a::attr(href)").get()
             if next_page_url:
                 yield response.follow(url=next_page_url, callback=self.parse)
    
  2. Data Dictionary

    To structure the scraped data, we can use a data dictionary. The following code defines a Scrapy Item class to store information about books.

    ```python import scrapy

class BookDataItem(scrapy.Item): item_number = scrapy.Field() title = scrapy.Field() price = scrapy.Field() stars = scrapy.Field() thumbnail_path = scrapy.Field() detailed_book_url = scrapy.Field()


    And here's the spider using the defined Item class:

    ```python
    import scrapy
    from book_data_item.items import BookDataItem
    class BookDataItemSpider(scrapy.Spider):
        name = 'book_data_item_spider'
        allowed_domains = ['books.toscrape.com']
        count = 0

        def start_requests(self):
            start_url = 'http://books.toscrape.com/'
            yield scrapy.Request(start_url, self.parse)

        def parse(self, response):
            self.log(f"I just visited {response.url}")

            for article in response.css("article.product_pod"):

                item = BookDataItem()
                # way we refer to a static class attribute
                BookDataItemSpider.count += 1

                item["item_number"] = BookDataItemSpider.count
                item["title"] = article.css("h3 > a::attr(title)").get()
                item["price"] = article.css("p.price_color::text").get()
                item["stars"] = article.css("p::attr(class)").get().split(" ")[-1]
                item["thumbnail_path"] = article.css("div > a > img::attr(src)").get()
                item["detailed_book_url"] = article.css("div > a::attr(href)").get()

                yield item

            next_page_url = response.css("li.next > a::attr(href)").get()
            if next_page_url:
                yield response.follow(url=next_page_url, callback=self.parse)
  1. Item Loader and Preprocessing

    Scrapy allows for complex preprocessing using Item Loaders. The example below shows a data class and an Item Loader class with processors for handling data.

    ```python import scrapy from scrapy.loader.processors import TakeFirst

    def remove_star_text(input_string): """ Simple preprocess custom method """ if "star-rating" in input_string[0]: return input_string[0].split(" ")[-1] return input_string

class BookDataItemLoaderItem(scrapy.Item): """ When using ItemLoader we can define an output and input process per field. Since ItemLoader returns lists by default, we can take the first element of each list -- the data we want -- with the TakeFirst method. We also did some processing with the remove_star_text method. """ item_number = scrapy.Field( output_processor=TakeFirst()) title = scrapy.Field( output_processor=TakeFirst()) price = scrapy.Field( output_processor=TakeFirst()) stars = scrapy.Field( input_processor=remove_star_text, output_processor=TakeFirst()) thumbnail_path = scrapy.Field( output_processor=TakeFirst()) detailed_book_url = scrapy.Field( output_processor=TakeFirst()) image_url = scrapy.Field( output_processor=TakeFirst()) product_description = scrapy.Field( output_processor=TakeFirst())


    Or can use class

4. **Pipelines**

    Scrapy pipelines serve as extensions that process items after they are scraped. They can be used for tasks such as cleansing HTML data, validating scraped data, checking for duplicates, and storing items in a database.

    According to the Scrapy documentation:

    > After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

    A typical use case for pipelines includes:

    * **Cleansing HTML data**

    * **Validating scraped data (checking that the items contain certain fields)**

    * **Checking for duplicates (and dropping them)**

    * **Storing the scraped item in a database**

5. **Zyte Cloud** Zyte Cloud offers a scalable cloud solution for web scraping, making it easier to run and manage your Scrapy spiders. Here's a brief guide on how to integrate Scrapy with Zyte Cloud:

    * Create a Zyte Cloud Account

    * Copy Zyte API key

    * pip install shub

    * shub login

    * shub deploy <project-id>

6. **Run Locally**

    To run the spider locally to can use where `-o` represent output file

    ```bash
    scrapy crawl <spider_name> -o output.csv
    or
    scrapy crawl <spider_name> -o output.json

Conclusion

Scrapy provides a powerful and efficient solution for web scraping in Python. Whether you are a beginner or an experienced developer, the techniques and best practices discussed in this post can help you make the most out of Scrapy for your data extraction needs. Explore its features, experiment with different strategies, and tailor your scraping approach based on the requirements of your projects. Happy scraping!

reference: