Recently, I faced an unusual requirement during the implementation of a project. My task was to implement a web crawler to index the content of a few websites and save those data into an Elasticsearch index for further analysis. The pitfall decision in this case lay in the fact that I had no strong reason to keep the extracted data in anywhere else, once that all user interaction with this data would be done using an web application that connects directly to Elasticsearch. But, if the Elasticsearch index mapping changes at anytime in the future, I would be forced to re-index part or all of the data, which means extract the same data from web sites again.
Adopting a relational database to address this need seemed to me an unjustified implementation effort. It would drastically increase the time, cost and complexity to implement and maintain the project, just to avoid a future risk of changes in my index mapping. Deal with database modeling, choose a persistence framework, implement extra tests, … I feel tired just to think about it. So, talking with my friend Paulo about this problem, he told me about the elasticsearch-river-mongodb project, an Elasticsearch plugin that propagates changes data from a MongoDB collection to an Elasticsearch index.
Use MongoDB seemed to be a good idea. The data extracted from website are not well structured and it is highly probable to suffer frequently changes. A schema-free / document oriented database fits well in this case, once that it’s flexible enough to accommodate changes on data structure with a minimum impact.
But, How to integrate Elasticsearch with MongoDB?
Despite the fact that elasticsearch-river-mongodb project seems to be awesome, offering filter and transformation capabilities, it is deprecated, having Elasticsearch 1.7.3 and MongoDB 3.0.0 as most recently supported versions. You can find more information about the deprecation decision on the article “Deprecating Rivers”.
It is a shame, but all is not lost. The MongoDB team offers mongodb-connector project which creates a pipeline to target systems and has a document manager for Elasticsearch. Great! And I’m so happy with the final result of this solution that I want to share my experience with you. My intention along of this post is to show what I found useful, what was tricky and what limitations I found during the implementation of this solution. Continue reading »