How a team created a web scraper with a service-oriented Architecture using Gearman, RabbitMQ and MongoDB.

Case study: Distributed data mining with PHP

About 2 years ago, our team started to work on an interesting analytics project, and the general technical goal was processing data from any web-page. We had to to get the data, process it with parsers and normalizers, and after that, we could carry out analysis with collected statistics.

When we don’t have too much data and aren’t carrying out tests with real “wild” data, we also don’t have any problems with nonstandard situations. At the beginning of the project journey we didn’t have problems like this – but not for long.

The project was growing, data size was increasing as well and we understood the problem – we had violated the first principle of S.O.L.I.D. The classes had too many responsibilities. For example, they could get pages from the Internet and extract data in one method. For simple tasks it is a good simple solution. But if we needed to aggregate more and more information, that solution becomes a nightmare.

“Time for refactoring,” I said. And after a little discussion, the solution was found. It was Service-Oriented Architecture (SOA). We defined general parts of our mining mechanism:

  1. Fetching pages, data from any API, data from structured files.
  2. Parse the fetched results. For structured data.
  3. Analyzing and creating recommendations.

A decision was made to separate it into 3 independent services, so we ended up with a Fetch Service, Parse Service and Analyze Service. But how will they communicate to each other?

To resolve this question we looked at the pipeline concept. This way seemed to be good enough and simple. Because there is “wild” data for input and output we need results of analysis (recommendations, reports, graphs).

It doesn’t matter what’s going on inside the system, the only requirement is that it must be easy to scale. The next decision was to create the pipeline from 3 services, and connect them via queues. The choice fell on RabbitMQ as a broker of the queue.

Figure 1: The main communication scheme of the three services.

Finally, the general architecture concept was done (illustrated in Figure 1). That was simple, very understandable for developers and easy to scale. But one may say: “Stop! SOA is awesome, but what do you have inside each service? What allows to scale fetching, parsing and analyze processes?”. It’s a good question! We’ll try to answer it below.

Used technologies

OK, let’s talk about technologies inside of each service. In this article we will be mostly describing the Fetch Service, but all of the services in our system have the same architecture. Let’s start with the common things. Each service consists of four components (Figure 2).

The first one is the processing module, that contains all the program logic for processing data. It’s based on Gearman workers and clients. This choice was made because it’s an easy way to organize distributed processing and also easy to scale it. Clients and Workers are run and controlled by Supervisor for a non-stoppable working loop. It’s necessary because tasks are created from queue messages and we need to permanently read this queue. Also, tasks should not be interrupted. If any worker or client go down, Supervisor will restart them immediately.

The second component is the results storage. It’s a database in MongoDB. MongoDB is good for storing web-pages with meta information. Because web-pages are documents, it’s an obvious choice. But we use MongoDB [Editor’s note: see Derick Rethans‘ recent introduction] as a data storage for each of the 3 services. Document-oriented concepts and the JSON format are nice for unstructured data and structured data without a strong schema. One of the important tasks of the system is collecting statistics about web-pages. And we have a lot of metrics, the number of which is increasing. With MongoDB we can easily add new metrics and data.

Finally, the third and the fourth parts of the service are queues. One queue is for receiving requests from other services or system’s clients, and the other queue is for notifications about completed tasks. As mentioned earlier, for this case we use RabbitMQ. This solution works well, but it is a really powerful technology and is possibly too complicated for our goals. Therefore we are looking for alternatives for getting a simplified solution. We are now thinking about ZeroMQ.

Service structure

Communication layer

So, communication is performed by the queues. It’s a pretty nice way to relate services. But maybe it’s not so clear how to use them. Here is described a typical communication case in the system.

There are two communication types. The first one is between services and the second is between the whole system and the endpoint client. For example, the Parse Service needs a new data. It’s sending a request to a queue of the Fetch Service, and then continues to fulfill other tasks as the request appears to be asynchronous. After the Fetch Service reads the message from the requests queue, it fetches needed data from outside (web-site, file, API, etc) and puts it to its data storage (Mongo database). After that, the Fetch Service puts the message to the notification queue about a completed operation. The Parse Service gets this message and gets fetched data from the storage. This is all about communication between services. All of the others also work like in this example, illustrated in Figure 3.

Service communication

As the system is used by external clients, and they can’t have direct access to the queues of each service and data storages, we developed a simple class for unification access to mining system. We also use it for inter-service communication too. Look at the simple code below for an example:

// Creating node for parsing and attach it to Fetch Service in a pipeline.
$node = new Node('parse', 'fetch');
// Sending request to the request queue of Fetch Service
/* Reading notifications queue of Fetch Service and if notification received get the result from the data storage */
if($key = $node->getNotify()) {
    $data = $node->getResult($key);

We have an object with page content and information about headers In the $data. How does it differ from the file_get_contents() function? In this case just one: we are processing other results while is fetching. But in a real case we have parallelized fetching, asynchronous processing of results and easy scaling for fetching thousands of pages by running more Gearman workers.

Fetch Service

Let’s talk a little more about the Fetch Service. This part of the system is a basis, the first stage in the pipeline. It can resolve the following problems:

  1. Request content from a network or a local machine
  2. Handle any errors and exceptions(like a HTTP errors)
  3. Provide basic information about results(http headers, file stats)

It is a very important part of any mining or parsing system. And the solution with SOA drives all the headaches away. We just say “Give me data” and it gives out what we want. This way allows us to create different Gearman workers for fetching data from specific sources, like a social API, XML, SOAP and others, while leaving the interaction interface as simple as possible.

Also remember, fetching data is just fetching data. If the fetching layer is created like our Fetch Service, many services can be created over that: crawlers, parsers, aggregators. But again and again, there is no need to think about working with the network.

We have talked about queues and data storage, so let’s move on to Gearman Fetch Workers. In general these are simple http-clients implemented with HTTP extension. Many PHP developers ignore this extension, but this is a very comfortable extension and should pay attention to this point. So, we have a simple code for fetching data and handling errors. However, we faced problems while creating a web-crawler. Two of them in fact:

  1. Store all of site pages with relations
  2. Knowledge about page’s views (page was viewed or not)

Storing relations makes database very large. We use MySQL for it, and relations tables take up gigabytes. It is OK, but some operations become hard. Of course MySQL is very a optimizable DBMS, but we found another solution. Since the pages and links between them form a graph, graph-oriented DBMS could be a nice choice. For this case we have chosen Neo4j. Although we are just beginning to develop with its use, we have optimistic expectations.

The second big problem we faced is storing information about page views. When the system is crawling a site, we need to know which pages were viewed and which were not. At the beginning this information was stored in MySQL. But it was very hard, frequently queries to DB occurred, and we had a big overhead for storing just one flag for URL. And after some discussions and with the help of my friend, Artyom Zaytsev, a solution was found.

The general idea of this is as follows: store the flag as a bit in the memory and use the hash of the URL for access to this bit. For implementation a PHP-extension was written for direct access and mapping files to the memory. This extension allows us to set “viewed” bit by a hash of URL and get the current state of bit. It works very fast.

So, I guess I’m about to finish here. There are many aspects of developing mining systems which haven’t been described in this article. But you should keep in mind that if you develop a complicated system with many components, need to think about architecture and first of all about the single responsibility principle. Isolate the components of the software, and connect them in simple ways. After that you will have a system, which is easy to scale, easy to observe and easy to use. Divide and conquer!

Special thanks to Andrew Kholmanyuk ( and Nikolay Karpenko for helping with correcting the article.

Kirill Zorin is a developer and software architect from Russia and developing recommendation systems for about 2 years. And now working at web-analytics project. Also, interested with theoretical aspects of a data mining development. Now he is living in Crimea and having fun with coding and sun. You can contact with him by email:

Unsere Redaktion empfiehlt:

Relevante Beiträge

Meinungen zu diesem Beitrag

- Gib Deinen Standort ein -
- or -