Quickstart

intake-elasticsearch provides quick and easy access to data stored in ElasticSearch

This plugin reads ElasticSearch query results without random access: there is only ever a single partition.

Installation

To use this plugin for intake, install with the following command:

conda install -c intake intake-elasticsearch

Usage

Ad-hoc

After installation, the functions intake.open_elasticsearch_table and intake.open_elasticsearch_seq will become available. They can be used to execute queries on the ElasticSearch server, and download the results as a sequence of dictionaries, or a data-frame.

Three parameters are of interest when defining a data source:

  • query: the query to execute, which can be defined either using Lucene or JSON syntax, both of which are to be provided as a string.

  • qargs: further arguments to pass along with the query, such as the index(es) to consider, sorting and any filters to apply

  • other arguments are passed as parameters to the server connection instance,

In the simplest case, this might look something like:

import intake
source = intake.open_elasticsearch_seq("*:*", host='elastic.server', port=9200,
    qargs={'index': 'mydocuments'})
result = source.read()

Where "*:*" is Lucene syntax for “match all”, so this will grab every document within the given index, as a data-frame. The host and port parameters define the connection to the ElasticSearch server.

Further parameters which can be used to modify how the source works are as follows. These are likely not altered often.

  • scroll: a text string specifying how long the query remains live on the server

  • size: the number of entries to download in a single call; smaller numbers will download slower, but may be more stable.

Creating Catalog Entries

Catalog entries must specify driver: elasticsearch_seq for the sequence of dictionaries version, and driver: elasticsearch_table for the dataframe version.

Aside from this, the same arguments are available as for ad-hoc usage. Note that queries are commonly multi-line, especially is using JSON syntax, so the YAML "|" character should be used to define them within the catalog file. A full entry may look something like:

args:
  qargs:
    index: intake_test
    doc_type: entry
  query: |
      {
      "query": {
          "match":
              {"typeid": 1}
          },
      "sort": {
          "price": {"order": "desc"}
          },
      "_source": ["price", "typeid"]
      }
  host: intake_es

where we have specified both the index and document types (these could have been lists), the fields to extract and sort order, as well as a matching term, loosely equivalent to "WHERE typeid = 1" in SQL.

Using a Catalog

Assuming a catalog file 'cat.yaml', and an entry called 'es_data', the corresponding dataframe could be fetched as follows:

import intake
cat = intake.Catalog('cat.yaml')
result = cat.es_data.read()

Since the query cannot be partitioned with this plugin, the other methods of the data source (iterate, read one partition, create Dask data-frame) are not particularly useful here.