How to Integrate Elasticsearch with MongoDB

How to integrate Elasticsearch with MongoDBRecently, I faced an unusual requirement during the implementation of a project. My task was to implement a web crawler to index the content of a few websites and save those data into an Elasticsearch index for further analysis. The pitfall decision in this case lay in the fact that I had no strong reason to keep the extracted data in anywhere else, once that all user interaction with this data would be done using an web application that connects directly to Elasticsearch. But, if the Elasticsearch index mapping changes at anytime in the future, I would be forced to re-index part or all of the data, which means extract the same data from web sites again.

Adopting a relational database to address this need seemed to me an unjustified implementation effort. It would drastically increase the time, cost and complexity to implement and maintain the project, just to avoid a future risk of changes in my index mapping. Deal with database modeling, choose a persistence framework, implement extra tests, … I feel tired just to think about it. So, talking with my friend Paulo about this problem, he told me about the elasticsearch-river-mongodb project, an Elasticsearch plugin that propagates changes data from a MongoDB collection to an Elasticsearch index.

Use MongoDB seemed to be a good idea. The data extracted from website are not well structured and it is highly probable to suffer frequently changes. A schema-free / document oriented database fits well in this case, once that it’s flexible enough to accommodate changes on data structure with a minimum impact.

But, How to integrate Elasticsearch with MongoDB?

Despite the fact that elasticsearch-river-mongodb project seems to be awesome, offering filter and transformation capabilities, it is deprecated, having Elasticsearch 1.7.3 and MongoDB 3.0.0 as most recently supported versions. You can find more information about the deprecation decision on the article “Deprecating Rivers”.

It is a shame, but all is not lost. The MongoDB team offers mongodb-connector project which creates a pipeline to target systems and has a document manager for Elasticsearch. Great! And I’m so happy with the final result of this solution that I want to share my experience with you. My intention along of this post is to show what I found useful, what was tricky and what limitations I found during the implementation of this solution.

My development environment is configured with the following components:

  • Ubuntu Desktop 16.04
  • Oracle Java SE SDK 1.8.0_91
  • MongoDB Community 3.2.8
  • Elasticsearch 2.3.5
  • Mongo-Connector 2.4.1.dev0
  • Elastic2-Doc-Manager 0.1.0
  • Python 2.7.11+
  • Pip 8.1.1

I won’t explain details about how I did setup my environment because all mentioned products have really nice pages with instructions about how to install each one. This article will focus only about how to setup the pipeline between MongoDB and Elasticsearch.

Starting and seeding our MongoDB with some records

In order to use the mongo-connector, our MongoDB server must be started with a replica set. The settings to deploy it in production environment is slightly more complex, but for development purpose, we can start the mongod daemon with the following parameters:

After that, initiate mongo CLI client and start the replica set:

Lets create a database called ‘blogs’ with a collection called ‘article’:

As we can see in the above snippet, we are creating a new collection of documents called articles with the following attributes:

  • title (String): indexed article’s title.
  • content (String): indexed article’s content.
  • indexed_at (Date): date when the article was indexed.
  • tags (Array): array with article’s tag.

This structure will be useful for us to understand the basic behavior of the pipeline and how the data types are mapped between the two products.

Starting mongo-connector

Assuming that (a) both services are running on localhost and (b) they were configured to use default TCP port, we can start the mongo-connector with default settings using the following command:

  • -m argument defines the host and port of MongoDB.
  • -t argument defines the host and port of Elasticsearch.
  • -d defines the document manager to be used.

Once that we started the mongo-connector, everything will be syncronized.

Exploring Elasticsearch

Lets check if an index for our ‘blogs’ database was created:

The output of the above command should be like:

As we can see, an index called blogs was successfully created on Elasticsearch. Now, lets see if a document type called articles, representing our collection ‘articles’ was created.

The output of the above command should be like:

As you will realize, all attributes of our collection were mapped using similar data types on Elasticsearch’s index. Now, lets check if the record was synchronized:

The above command should generate an output like:

All right! Everything is working properly. Try to update the document on MongoDB or remove it, and you will see that the changes are propagated to Elasticsearch index.

One step forward: configuration file

Its possible to start the connector passing a configuration file as an argument of command line, like bellow:

All the available configuration options can be checked in details on the official project wiki page. We can highlight the capability to define which collections to read from mongodb, logging settings (including the rotation capability) and authentication credentials. It also pretty simple to start the connector as a Linux service, working smoothly on Ubuntu 16.04.

Limitations

I couldn’t find any way to perform transformations during the integration process. The configuration file allow us to define a list of fields to include or exclude, the capability to specify the name to be used by the Elasticsearch’s index, but not much more than this. Any kind of transformation must be performed on the Elasticsearch side after the synchronization, or you can drive your MongoDB document modeling to avoid the necessity of any transformation. At least in my case, the second option was not a problem.

Conclusion

If you are orphan of elasticsearch-river-mongodb or if you need to index your MongoDB documents in a seamless way, the mongo-connector is a good reliable solution despite the lack of transformation capabilities.