Complex queries and fuzzy matches are easy with open source search and analytics tool Elasticsearch. Zachary Tong shows how to get to grips with the PHP driver.

Stretching your data with Elasticsearch
Kommentare

Image licensed by Ingram Image

Regardless of your domain, you have data. Perhaps it is e-commerce data and product listings. Maybe you have a million clickstream entries from advertising. Or perhaps you have billions of log records that you need to analyze. ] Whatever it is, you have data. And you need to access that data somehow, to interrogate it and reveal trends, patterns and insights that benefit your business. Storing data is almost never a problem…but how do you query it?

Consider the three following questions:

  • Structured: Which employees are over 30?
  • Unstructured: Which employees have names starting with ‚Zac‘?
  • Analytics: How many employees does each department have?

Traditional databases can answer the first question without a problem. Dates, numbers, exact values, ranges … structured search is simple for a database. Unstructured search, however, is nearly impossible with most databases. Their data structures simply do not allow them to answer unstructured, full text questions with any kind of performance. Likewise, analytics and aggregations are also very difficult. Realtime aggregations are even harder, and as data grows, you often resort to batch aggregations that run on a schedule.

Finally, what happens if you want to ask the following question?

  • „Show me the number of employees in each department who are over 30 and have names starting with ‚Zac“

That will be painful in a database no matter how well cleverly you organize your data and tweak your query. Unfortunately, these types of queries are often the most valuable to your business.

Introducing Elasticsearch

Elasticsearch was born as a full text search engine, but it has grown to answer all four scenarios presented above. Elasticsearch can handle structured search just as well (and just as fast!) as unstructured search and real-time analytics.

At a high-level, Elasticsearch is a distributed document store where every field is indexed and searchable in near-real-time. It speaks HTTP over a RESTful API, which means it is dead simple to interact with. It runs on one laptop just as well as on a 100 node cluster indexing literally terabytes of data.

It may seem „too good to be true“, but it simply the result of Elasticsearch’s underlying data structure. While most databases rely on structures like B-Trees, Elasticsearch uses an inverted index. The nature of inverted indices means that all fields are indexed, and lookups are fast regardless of how much data you have.

But more important than technical details, Elasticsearch is easy to get started with. You can get a cluster running in under five minutes.

Your first cluster

The best way to get familiar with Elasticsearch is to try it yourself. Let’s download Elasticsearch, install it and start a node:

$ curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.tar.gz
$ tar -xzf elasticsearch-*.tar.gz
$ cd elasticsearch-*
$ ./bin/elasticsearch -f
$ curl 'http://localhost:9200/?pretty'

{
  "tagline" : "You Know, for Search", 
  "ok" : true,
  "status" : 200,
  "name" : "Contrary",
  "version" : {
    "number" : "0.90.5",
    "snapshot_build" : false 
  }
}

The only requirement is that Java must be installed on your system. Once you start the node, you’ll see some startup text print to the terminal.

You’ll notice that Elasticsearch returned JSON output as the body of the HTTP response. Elasticsearch speaks JSON as it’s language of choice. Everything is JSON, including documents, requests and responses. This makes it very easy to interact with Elasticsearch from within PHP; JSON can easily be encoded/decoded into more useful associative arrays.

Indexing Documents with PHP

While we can interact with the node using curl (and you’ll notice that the documentation online is entirely in curl + JSON requests), we ultimately want to use PHP to query Elasticsearch from within our application.

Elasticsearch maintains an official PHP client which we will use for the rest of the article. The PHP client is installed with Composer ([3][see installation directions here]).

Elasticsearch is a document-based system. Each document is simply a JSON object that contain the fields and values that you wish to store. The PHP client transparently converts associative arrays to JSON and back, so you never actually have to deal with raw JSON.

$client = new ElasticsearchClient();

$document = array(
  'name' => 'John Smith',
    'age' => 26,
    'hobbies' => array('biking', 'surfing'),
    'employer' => array(
        'name' => 'MegaCorp',
        'size' => 49293
    )
);

$params['body']  = $document;
$params['index'] = 'company';
$params['type']  = 'employees';
$params['id']    = 'JohnSmith';

$ret = $client->index($params);
*/

In the first part of the code above, we create a document that has some data. You’ll notice that $document contains a number of fields with different data types: strings, integers, arrays, and even objects. Elasticsearch will accept all of these.

Next we specify an index with a type and an id. Indices and types are basically logical namespaces to organize data. An Elasticsearch cluster can contain multiple indices, and a single index can contain multiple types. You can search across multiple indices and multiple types, at the same time, with no loss in performance. See this article for more details about the nature of an index in Elasticsearch.

Lastly, you’ll notice that when we index the data we didn’t need to define a schema first. Elasticsearch will auto detect the schema from your first document, which means you can prototype systems quickly and easily.

Search

Ok, so we’ve indexed a document. Let’s do some searching!

$params = array();
$params['index'] = 'company';
$params['type']  = 'employees';
$params['body']['query']['match']['name'] = 'John';

$results = $client->search($params);
print_r($results['hits']['hits']);

> Array
(
    [0] => Array
        (
            [_index] => company
            [_type] => employees
            [_id] => JohnSmith
            [_score] => 0.19178301
            [_source] => Array
                (
                    [name] => John Smith
                    [age] => 26
                    [hobbies] => Array
                        (
                            [0] => biking
                            [1] => surfing
                        )
                    [employer] => Array
                        (
                            [name] => MegaCorp
                            [size] => 49293
                        )
                )
        )
)

In the script above, we are going to perform a simple search for „John“. Elasticsearch has a number of different queries that you can choose from, but in this example we are using the Match Query.

The Match Query is a good default choice, offering a great „search box“ experience with little configuration. In the above script we are searching the company index and employees type (which makes sense since that is where the document was indexed).

The Match Query is defined in the body of the search request. It will search the name field of every document for our query text. The search results come back as a big associative array. The list of matching documents (ranked by relevance score) is printed at the end. You can see that our original document was matched, and the search result contains the entirety of the original document (plus some extra meta-data).

Fuzzy Searching

That example was kind of boring. Let’s try something that a database would struggle with: dealing with typos. There are many ways to deal with typos in Elasticsearch depending on your requirements, but we are going to use the Match Query again. You’ll soon learn that the Match Query is basically the swiss-army knife of Elasticsearch.

$params = array();
$params['index'] = 'company';
$params['type']  = 'employees';
$params['body']['query']['match']['name'] = array(
    'query' => 'Jahn',
    'fuzziness' => 0.5,
);

$results = $client->search($params);

print_r($results['hits']['hits'][0]['_source']['name']);
> John Smith

The Match query can be configured to tolerate a certain amount of „fuzziness“ when analyzing documents. This operation allows a number of „edits“ to be made to the query.

For example, if we replace the ‚o‘ in ‚John‘ with an ‚a‘, that counts as a single edit. The default fuzziness of 0.5 will allow one edit, and thus, the query above will match the document even though ‚Jahn‘ is not stored in any document.

This may seem expensive computationally: how can you check typos without comparing every single value in the index? Luckily, Elasticsearch uses some very complicated Finite State Transducers that ensure that the operation is performant and does not, in fact, do a simple table scan.

Filtering Data

The queries we have looked at so far all calculate relevance scores. Some results are more relevant than other results, which is reflected with a higher score.

In many queries, there are certain elements that don’t need relevance scoring. For example, a number is either inside a range or it is not. There is no concept of being „more“ in the range – it is just a yes/no answer.

In these situations, you should use a filter. Filters perform the role of structured search in Elasticsearch, and are very efficient since they can skip the entire scoring phase. Furthermore, Elasticsearch aggressively caches filter bitset 8 so that subsequent filtering is even faster. Filtering documents is simply a bitwise AND across multiple filters – an operation that is extremely fast.

$params = array();
$params['index'] = 'company';
$params['type']  = 'employees';

$query['match']['name'] = 'Smith';
$filter['range']['age']['gt'] = 20;
$params['body']['query']['filtered'] = array(
    'query' => $query,
    'filter' => $filter
);
$results = $client->search($params);

print_r($results['hits']['hits'][0]['_source']['name']);
> John Smith

In the above example, we use a Filtered Query. This is a special compound query that accepts a query and a filter. The query section is only executed on documents that match the filter, greatly reducing the number of computations required.

The query section is similar to what we’ve seen before: a Match Query looking at name. The filter section is using a Range filter and matching all documents where the age field is greater than 20.

Composability

You might have noticed in the last query that we reused a Match Query inside of the Filtered Query. The Elasticsearch query DSL is entirely composable. Queries and Filters are independent components that can be nested inside of other compound queries/filters.

This makes the query DSL very powerful. You can easily tailor queries to your exact specification. Furthermore, it allows your application to build components of the query in separate locations and then stitch them all together in one location.

Conclusion

Elasticsearch offers much, much more than what we discussed in this article. We’ve only scratched the surface, but hopefully it has given you a taste for the power of Elasticsearch. Luckily, Elasticsearch has been built with excellent defaults. In many cases, you can use defaults and simple queries until the point in time that you need more power. Then you can investigate a few new features and add them to your repetoire.

In this way, your query sophistication grows as you become more accustomed to the query DSL. But the system is designed to be easy from day one, which means you can begin prototyping right away, even if you only have a minimal undestanding of how Elasticsearch works.

Elasticsearch is a system where the more you use it, the more uses you find. I hope this article piqued your interest and you explore Elasticsearch some more in the future!

Zachary Tong is a developer at Elasticsearch and author of the official PHP client. He’s been writing articles about Elasticsearch for nearly two years, and has released several plugins to help newcomers understand what Elasticsearch is doing under the hood.

Unsere Redaktion empfiehlt:

Relevante Beiträge

Meinungen zu diesem Beitrag

X
- Gib Deinen Standort ein -
- or -