Distributed Grabber Development for Tmall.com

Technologies:

Celery (4)
PostgreSQL (27)
Auto Testing (25)
Django (23)

Domains:

E-commerce (5)
Internationalization (2)

Project Goals

The Distributed Grabber Development for Tmall.com project aimed to create a geographically distributed parsing solution capable of handling the vast volume of data available on Tmall, a leading Chinese e-commerce platform. The key goal was to develop a system that could efficiently parse millions of products across more than 2,000 categories, while simultaneously translating product data into Russian for the targeted market.

Functional Capabilities

  • Distributed Parsing Network: The solution involved the development of a distributed parser that used a geographically distributed computer network to extract data from Tmall. A central grabbing coordinator controlled the parsing operations, distributing tasks to individual nodes for efficient processing.
  • Proxy Rotation for Distributed Parsing: Implemented proxy rotation to ensure the parser could work efficiently without being blocked by Tmall's security systems. This allowed products to be fetched at intervals from different geographic locations, enhancing the system's scalability and robustness.
  • Real-Time Translation to Russian: One of the unique features of the grabber was the ability to perform real-time translation of product data into Russian while simultaneously parsing the content. This feature added significant value for the client, who aimed to target the Russian-speaking market with localized product information.
  • Task Distribution Among Nodes: Pocket-sized computers were procured and used as nodes in the distributed parsing system. These nodes were distributed among friends and acquaintances, who connected them to the network and participated in the parsing process. Each node parsed a specific group of products, and the extracted data was consolidated into a centralized database.
  • Scalable and Flexible Architecture: The system was designed to be scalable, allowing additional nodes to be easily added to the network to handle increased parsing loads. The flexible architecture ensured that parsing tasks could be redistributed if certain nodes went offline.

Solution Concept

The distributed grabber was developed to efficiently handle the immense volume of products listed on Tmall. Given the scale of Tmall's platform, parsing all product categories and items from a single machine was impractical. Instead, a distributed approach was adopted, involving multiple nodes connected through a central coordinator.

The central grabbing coordinator was responsible for managing the nodes and distributing tasks to them. Each node worked independently to parse assigned product groups, rotating proxies to avoid detection and ensure continuous operation. The use of pocket-sized computers as parsing nodes allowed the distributed network to be easily scaled by adding more devices to the system.

The backend of the grabber was built using Python, with Django providing a robust framework for managing data and developing core functionalities. Celery was used for task management, distributing parsing jobs among the nodes, while RabbitMQ and pika handled communication between the nodes and the coordinator, ensuring smooth coordination of parsing activities. PostgreSQL was used as the database to store the parsed product information, providing a reliable and scalable storage solution.

Real-time translation was implemented to translate product data from Chinese to Russian, making it accessible to the targeted Russian-speaking market. This feature was integrated directly into the parsing process, ensuring that the translated data was available immediately for further processing and storage.

Results

  • Efficient Parsing of Large Volumes of Data: The distributed parsing network successfully tackled the challenge of parsing millions of product listings and thousands of categories, extracting data efficiently and storing it in a centralized database.
  • Real-Time Translation for Market Localization: The real-time translation feature allowed product data to be translated into Russian while being parsed, adding significant value for the client by localizing the content for the Russian-speaking audience.
  • Scalable and Cost-Effective Solution: The use of pocket-sized computers as parsing nodes allowed for a scalable and cost-effective solution. New nodes could be easily added, and the distributed nature of the system ensured that parsing could continue even if individual nodes experienced issues.
  • Resource Sharing for Optimized Parsing: Collaboration with friends and acquaintances helped optimize the parsing process by sharing the workload across multiple nodes distributed in different geographic locations, minimizing the chances of detection and blocking by Tmall's security measures.

Technologies and Architecture

  • Backend Development:
    • Python: Used for the core parsing logic, data processing, and integration of the real-time translation feature.
    • Django: Provided a stable backend framework for managing parsed data and building core features.
  • Task Management and Communication:
    • Celery: Employed for task management, allowing parsing tasks to be distributed among the nodes.
    • RabbitMQ and pika: Used for managing the communication between the central coordinator and the distributed nodes, ensuring smooth and efficient task distribution.
  • Database Management:
    • PostgreSQL: Used as the centralized database to store parsed product data, ensuring data integrity and scalability.
  • Infrastructure and System Design:
    • Distributed Nodes: Pocket-sized computers were used as distributed nodes for parsing, allowing for an easily scalable system. These nodes were connected through a central coordinator, which managed task distribution and ensured efficient parsing operations.
    • Proxy Rotation: Implemented to allow the distributed nodes to parse product data from Tmall without being blocked, ensuring uninterrupted operation.

User Cases

  • Data Extraction for E-commerce Expansion: The system allowed the client to efficiently extract product data from Tmall, including detailed product descriptions, images, and pricing, enabling them to expand their e-commerce offerings with localized content for the Russian market.
  • Market Localization: The real-time translation feature provided localized product information for the Russian-speaking audience, improving accessibility and increasing the potential market reach for the extracted products.
  • Scalable Data Processing: The distributed nature of the grabber allowed the client to scale the parsing operations easily by adding new nodes, ensuring that the system could handle increased data extraction requirements as needed.

Integration and Development Process

  • Requirements Gathering: The project began with gathering requirements from the client to understand the scope of data extraction needed, the target market, and the need for real-time translation.
  • Architecture Design and Prototyping: The system architecture was designed to be distributed and scalable, with a central grabbing coordinator managing multiple parsing nodes. Prototyping helped validate the feasibility of the distributed approach.
  • Team Formation and Resource Allocation: A team of developers and system architects was formed to work on the core components of the distributed grabber. Pocket-sized computers were procured, and friends and acquaintances were engaged to help set up and run the nodes.
  • Implementation and Testing: Implemented the parsing logic using Python and developed the backend using Django. The system was rigorously tested to ensure that the distributed parsing worked efficiently and that the real-time translation feature met the client's requirements.

Client Benefits

  • Scalable and Efficient Data Extraction: The distributed grabber provided a scalable and efficient solution for parsing a vast database of products, reducing the time needed to gather and process product information.
  • Localized Content for Target Market: The real-time translation feature allowed the client to target the Russian-speaking market effectively, providing localized content that increased accessibility and customer engagement.
  • Cost-Effective Distributed Parsing: The use of pocket-sized computers for parsing made the solution cost-effective while still ensuring that the system could handle large volumes of data.
  • Collaborative Resource Utilization: Collaboration with friends and acquaintances allowed the client to utilize shared resources for parsing, optimizing costs and ensuring distributed workloads for efficient data extraction.