Friday, February 16, 2018

Writing and integrating tests for logstash ruby filter script in maven build

Problem statement

In my current release of the product we had to do Elasticsearch (ES) upgrade. The product was on ES 1.x and we had to move to ES 6.x. Now, since ES supports upgrade only from previous major version, the only option we had was to do re-indexing of the data. We used following approach to re-index the data:

  1. Implement a java executable jar to directly read the index files of ES 1.x using lucene APIs and write each document to stdout.
  2. Use logstash to invoke the above jar and read the stdout.
  3. Create logstash filter configuration to transform the data from ES 1.x into ES 6.x compatible form.
  4. Once the data is transformed then write it to the ES 6.x instance.
This worked well for us except for #3. The transform configuration written in a configuration file was not readable and maintainable. The editors would interpret the configuration file as plain text. So without any highlighting and indentations it was not readable at all. Also, we had to write transform for almost all the data (events) we had so there was too much of if-elseif-elseif....else. This was not only hampering the readability but also would have impacted the event transformation performance. For every event doing many if-else checks is definitely not good. What we wanted was a clean, readable & maintainable transformation.

Solution

The logstash ruby filter was the way to go. So, I rewrote the entire transformation in ruby script as follows:
  • Wrote a single ruby script file for transformation.
  • Avoided if-else statements and instead used look ups to determine what transformation needs to be done per event type.
  • Invoked the logstash provided filters like mutate, date etc.
  • On top of this we also wrote product specific transformations.
  • Added logging to the script.

Integrating with maven build

While implementing the transformation in ruby made it more readable and maintainable, there was one thing missing. It was not testable outside logstash. One can write inline test for ruby filter but they have to be executed within logstash. What I wanted was to integrate the tests with our maven build so that if there are any test failures, the build should also fail.
Since I was new to ruby I had to spend some time understanding ruby gems etc. and finally was able to run the ruby transform script in both maven build as well as logstash. This was achieved as follows:
  • Copied the logstash deployment folder locally along with our maven project.
  • In the maven pom.xml used the exec-maven-plugin to execute the ruby script.
  • For the script to run properly, set the paths properly for the jars as well as ruby gems.
  • Quickly implemented a small testing framework in ruby which would:
    • Scan for test classes.
    • Execute the test method. Test methods were supposed to have name starting with test_ e.g. test_event_types
    • If the test fails then log the error.
    • Execute all the tests and then print summary like total tests executed, tests passed/failed etc.
    • If there is even a single test failure then fail the maven build.
  • Once integrated we wrote many tests around the transformation per event type.

Implementation

The working sample for above approach is available at: https://github.com/ajeydudhe/java-pocs/tree/master/logstash-ruby-transform

Tuesday, October 10, 2017

Simple sequential workflow using Spring framework and Spring Expression Language

Requirement

Recently, in one of my projects, there was a requirement for simple workflow around some background processing tasks. Most of the tasks were already implemented so we had to just implement the workflow around those.

Available solutions

There are many workflow frameworks available for Java. Even though they met the expectations, they seemed to be too heavy for a simple requirement. Next, there were blogs around how to use Spring framework for creating simple workflows using Spring beans. This indeed was what I was looking for. However, something was missing here. Each action in the workflow can be a Spring bean with a simple interface having an execute method as follow:

But how do we pass parameters to the actions. Most of the implementations would pass a map between actions to the execute() method. This means an action would add it's result to the map using a key and the next action would read the value from this map. It would make the actions aware of each other or at least aware of the keys for passing information. If the key or the data type of value changes then the next action reading the value would break.

Expected behavior

It would have been far better to implement an action as a normal class with its dependencies being injected using, say, Spring dependency injection. Then during workflow execution the execute() method can be invoked to get the desired output. We should be able to pass the output from one action to the next action without the actions being aware of each other.

Solution

Using the Spring Expression Language

The Spring Expression Language allows to have simple expressions defined and evaluated against specified object instance. The bean XML allows using such expression to inject property values. The idea was to use SpEL for defining the values to be injected into the workflow action. Let's take an example of a simple action which takes a string as an input and simply returns reverse string as output. The action is implemented as follows:

It simply takes a string value as an input in the constructor itself and in the execute() call returns the reverse string. The instance of the above action will be created within the code using the application context's getBean() method. But how do we inject the required value? For that first let's have a look at how to define the bean for this action class.

Above is a simple bean definition mentioning the implementation class and the parameters. In this case we are indicating the value to be passed to the constructor. Notice, that the value is actually an expression starting with %{ and ending with }. We have defined our custom expression prefix using % because Spring already uses # and $ for identifying it's expressions. The variable name source can be anything you want. For our example, the source refers to the original input to the workflow. It can be anything from a simple string, integer, class instance etc. to array, set, map etc.

For our workflow, we will have list of beans representing the actions. The workflow will be responsible to create instance of the actions and execute them but while doing so it should be able to pass the parameters defined by our custom expression. Let's see how we can do this.

Using custom expression resolver

Spring allows to have custom expression resolver implementing the BeanExpressionResolver interface. Following is the snippet for our BeanInstantiator class implementing this interface as well as the BeanFactoryPostProcessor interface.

The postProcessBeanFactory() method at Line 4 is for the BeanFactoryPostProcessor interface. Here, we are going to register our custom resolver which is the same class. Also, we are preserving the existing resolver.

The evaluate() method at Line 10 is for the BeanExpressionResolver interface. Whenever, Spring encounters an expression then this method will be called. Here, we are checking if the expression is our custom expression. The expression string can be enclosed in %{} e.g. %{source} or it can be part of any other string value e.g. This will have %{source} expression embedded inside. If we have a custom expression then:
  • we create instance of SpelExpressionParser.
  • parse the expression.
  • invoke the parsed expression and pass in the required data.
In above case, Line 16 delegates the evaluation call to doEvaluate() method which does above steps. One thing abstracted here currently on Line 16 is the EXPRESSION_EVALUATION_CONTEXT which is instance of EvaluationContext. We are storing this value in thread local because the evaluate() method is called internally by Spring and there is no other way for us to pass the context.

Passing the context to Spring expression resolver

In above snippet we have seen how to add our own expression resolver. But how do we pass in the required data to the resolver? In our case we have use the expression as %{source} saying pass in the original source input to the action instance through constructor. Following is the additional snippet for the same BeanInstantiator class above (previous methods removed for readability).

The BeanInstantiator class also implements the ApplicationContextAware interface. So we get the application context passed into the setApplicationContext() method at Line 4.

The BeanInstantiator class exposes the getAction() method. This method takes the bean id of the workflow action to be instantiated. It then uses the applicationContext.getBean() method to create instance of the action class. But how will our custom expression be resolved to inject the required value specified by the expression %{source}.

Line 11 in getAction() method creates an instance rootEvalObject of anonymous class having getters getContext(), getSource(), getOutput() returning the original values.

Line 16 in getAction() creates the instance of StandardEvaluationContext passing the rootEvalObject.

But getAction() is our method and we are just calling the applicationContext.getBean() to get the instance. Spring will internally call the evaluate() method on our class since it also implements the BeanExpressionResolver interface.

Line 17 sets the StandardEvaluationContext instance to thread local variable EXPRESSION_EVALUATION_CONTEXT and then calls applicatinContext.getBean() to create the instance. In the evaluate() method called by Spring we retrieve this context and then evaluate the expression.

Since we have used the getters as getContext(), getSource(), getOutput() you can refer to them as properties or getters directly in expressions e.g. as %{source} or %{getSource()}

You can add as many properties you want like getContext(), getOutput() etc. returning whatever values you want. You can even add methods to be used in expression e.g. to transform data, to get some configuration etc.

Workflow-lite implementation

The implementation for the workflow is at: https://github.com/ajeydudhe/workflow-lite

It allows one to define & execute the workflow using UML activity diagram. Mostly useful for implementing predefined workflows in a product.

Thursday, June 22, 2017

Elasticsearch: Extending ES using AOP

Elasticsearch provides the ability for extending the basic functionality using scripts or plug-ins. But at the same time ES has restrictions on changing or extending existing functionality like search actions etc. For example, say, I need to change the result set for every query or change the input search criteria etc. In such a case, we can use AOP to extend ES functionality.
AOP is a vast topic in itself so I will not cover it here. Basically, it allows to control what happens before a method execution, after a method execution, change the original input parameters, change the return values etc.
For this post we will try to monitor the ES search parameters using AOP. We will use AspectJ programming using maven project as follows:
  • Create the maven project
  • Find out the ES methods to be monitored
  • Define the Pointcuts & Advices
  • Compile the jar
  • Use Load Time Weaving
  • Start ES and monitor the queries

Source code
The source code for this example is located here.

Create the maven project
Create a simple  maven jar project with following POM xml.
  • Add dependencies for aspectjrt line 13-17 and aspectjweaver line 18-22
  • Add aspectj-maven-plugin line 31-48

ES methods to monitor
We will monitor the searches by ES. The actual search execution is done by the SearchService. We will monitor the executeSearch() and executeFetch() methods on this class.

Define Pointcuts & Advices
The SearchServiceAspect class defines the Pointcuts & Advices using annotations. Following is the definition to monitor the SearchService.executeSearch() method:
  • Line 5 defines the Around advice.
  • The advice is for ES method SearchService.executeQueryPhase() which takes two parameters: ShardSearchTransportRequest & SearchTask
  • It also mentions that these two should be passed to our method which will handle the execution.
  • Line 6 is the method in our class which will be executed when SearchService.executeQueryPhase() is to be executed.
  • In this method we can decide what to do i.e. do some processing or continue execution using joinPoint.proceeed() etc.
  • For this example, we are just logging the search query information.

aop.xml
Next we define the aop.xml which acts as input to the AOP compiler.
  • Line 4 declares that our aspect pointcuts & advices are in class my.elasticsearch.aspects.SearchServiceAspect
  • Line 6 tells the weaver to be more verbose to help troubleshooting the issues if any.
  • Line 7 tells the weaver which class in the target application is to be woven.
  • The aop.xml needs to be present at src/main/resources/META-INF in the project.

Deploying the jars
  • Compile the project and generate the jar using command: mvn install
  • Copy the generated es5x-method-interceptor-0.0.1-SNAPSHOT.jar into the elasticsearch/lib folder as per the ES installation.

Configure ES startup
Now, we need to start ES using the aspect weaver as java agent. 
  • Copy the aspectjweaver-1.8.9.jar into elasticsearch/bin folder.
  • Open the bin/elasticsearch.bat or bin/elasticsearch.sh file (I have verified with Windows batch file)
  • Add following line before the java command execution to launch the ES
             SET ES_JAVA_OPTS=%ES_JAVA_OPTS% -Djava.security.policy=enable_aspectj_classes.policy -javaagent:aspectjweaver-1.8.9.jar
  • With above change we are basically passing two additional command line parameters while start the ES:
    • javaagent: Using the aspectj weaver as java agent.
    • java.security.policy: ES 5x uses java security manager and hence our aspect weaver won't work. Using the above parameter we are asking the java security manager to grant the required permissions. The content of enable_aspectj_classes.policy is as follows:
                    grant {
                                      permission java.security.AllPermission;
                                    };
          • We are granting all permissions to the application. It is recommended that you find out granular permissions by enabling the logging at access level and update the above file as required.
          • The file is placed in elasticsearch/bin folder.
      The logs from weaver won't be included in the ES logs. So you can redirect the ES command line output (if not running as service/daemon) to a file as follows:
      elasticsearch.bat > c:\es5x.log 2>&1
      You should see following logs:
      Line 18 & 19 indicates that out pointcuts & advices have been woven.

      Populating test data
      • We will create two indices as aop_test_01 & aop_test_02 having file type with following data:
              {"name": "file_01", "size": 5678}
      • Populate as many entries as you want.
      • Make sure both indices have some data

      Querying the data
      Lets issue a simple search snapping both the indices as:
      http://localhost:9200/aop_test_*/_search
      You should see that our advice is getting executed from the following logs:
      • Since we are querying multiple indices resulting in multiple shards query we see both executeQueryPhase() & executeFetchPhase() advice getting called.
      • In executeQueryPhase() we are just logging the indices. We can log additional info like the query to be executed etc.
      • In the executeFetchPhase() we are also logging the docIds which the search returns.
      • You can read more on how search gets executed here.

      Summary
      With this simple POC we have hooked our custom code into ES which will allow use to customize ES behavior as needed. Currently, we have just logged the input requests but can do more like change the input/output etc.


      Wednesday, December 9, 2015

      AngularJS: Getting started

      Introduction

      AngularJS is one of the most popular frameworks for web UI development. Following are some major features of the framework:
      • Makes DOM manipulation easy.
      • Segregates back-end (JavaScript) from UI (html).
      • Provides dependency injection.
      • Easy to develop Single Page Application (SPA) using the routing support.
      Instead of directly jumping into the AngularJS features like modules, controller, services etc. let's try to build a very basic UI using AngularJS and understand how AngularJS works.

      Sample UI - Traditional way

      Let's try to develop following UI using traditional approach and then using AngularJS.

      This is very basic UI displaying the administrator details like name and email address. Then there is a checkbox to toggle display of some advanced options. If checked it will show advanced options as follows:

      HTML markups

      The HTML snippet for the UI will be as follows:

       <div>
         <h2>Administrator</h2>
         <label><b>Name: </b></label><span id="lblAdminName"></span>
         <label><b>Email: </b></label><span id="lblAdminEmail"></span>
         <input id="chkAdvanceOptions" 
                type="checkbox" 
                onclick="toggleShowAdvanceOptions();">
            <label>Show Advance Options</label>
         <div id="divAdvanceOptionsstyle="display: none">
            <b>Advance options here...</b>
         </div>
       </div>

      Populating administrator details

      The UI can be populated using following JavaScript:

      function initializeSettings () {

         var administrator = getAdministratorDetails();           
         document.getElementById('lblAdminName').innerHTML =   administrator.name;  
         document.getElementById('lblAdminEmail').innerHTML = administrator.email; 
      } 

      function toggleShowAdvanceOptions() { 
         
        var displayAdvanceOptions = document.getElementById('chkAdvanceOptions').checked;        document.getElementById('divAdvanceOptions').style.display 
                                  = (displayAdvanceOptions ? 'block' : 'none'); 
      } 

      function getAdministratorDetails() { 
         // This will come from a REST call. 
         return { name: 'Administrator', email: 'administrator@company.com' }; 
      }


      Now, above UI will be populated using JavaScript as follows:
      • Fetch the administrator details from server, say, using REST calls.
      • To populate the administrator name get hold of the <span> element using the id i.e. lblAdminName e.g. document.getElementById("lblAdminName") or using jQuery.
      • Set the value on the DOM element. Same will be the case for displaying the email address.

      Checkbox handler

      While fetching and populating the administrator details seems trivial, toggling the advance options required some more work:
      • Associate a JavaScript function handler for checkbox click. In this example it is toggleShowAdvanceOptions();
      • In this handler find out if the checkbox is checked or not. Depending on this, toggle the visibility of the div displaying advance options.


      Issues with traditional approach

      One may feel that with above approach the UI and backend are segregated i.e. UI is in *.html or *.jsp file with HTML markups and backend is in *.js i.e. JavaScript file. But segregation does not mean just separating these out. Let's try to answer a simple question: Who is responsible for how the UI looks and behave? Obviously, *.html defines the UI in terms of look and feel while *.js provides the required data. But wait, the *.js file not just provides the data, it actually pushes the data to UI. Ok, then who is responsible for toggling the display of advance options? Hmmm, *.html hooks the checked event to backend and then backend toggles the visibility. 
      To summarize, the responsibility to render the UI is with both *.html and *.js files. The reality is that we cannot avoid such a situation i.e. dynamic content needs to be handled by JavaScript. So how does AngularJS help here? In simple terms, it allows us to define the UI and it's behavior in *.html files while the JavaScript files needs to be responsible for fetching the data or reacting to events. But in no way the JavaScript files will deal with UI elements i.e. no more document.getElementById() or jQuery calls in backend. Let's see how this is possible.

      Sample UI - AngularJS way

      With AngularJS we are not doing away with *.html or *.js files. Instead of getting directly into AngularJS concepts, let's first see how these files are defined for AngularJS based web application.

      JavaScript object

      Let's start with the JavaScript object that will be responsible for populating the data on UI.

       function SettingsController () {
        // Get this from REST service
        this.administrator = {
          name: 'Administrator',
          email: 'administrator@company.com'
       };

       this.showAdvanceOptions = false;
      }

      As seen above the JavaScript object simply has two properties:
      • administrator: This property will hold the json describing administrator name and email. It should be populated using REST call. For this example, we have hard-coded it.
      • showAdvanceOptions: This property tells if the advance options should be displayed or not. The default value is false.
      That's it. No getElementById() or any DOM manipulation code here. 
      Now, let's look at the HTML.

      HTML markups

      The HTML snippet for the UI will be as follows:

      1: <div ng-controller="myapp.settings.controller as ctrl">
      2:  <h2>Administrator</h2>
      3:  <label><b>Name: </b></label><span>{{ctrl.administrator.name}}</span>
      4:  <label><b>Email: </b></label><span>{{ctrl.administrator.email}}</span>
      5:    <input type="checkbox" ng-model="ctrl.showAdvanceOptions">
      6:       <label>Show Advance Options ({{ctrl.showAdvanceOptions}})</label>
      7:    <div ng-show="ctrl.showAdvanceOptions">
      8:       <b>Advance options here...</b>
      9:    </div>
      10:</div>

      The markup is same as for traditional approach except that there are additional attributes. But in this case we are just defining how the UI is rendered and behaves in markup itself. Let's look at the markup one by one.
      • Line 1: 
        • On the <div> tag we are specifying which JavaScript object will be responsible for providing the data to this UI.
        • ng-controller is AngularJS specific attribute which defines the JavaScript object to be used.
        • For this discussion, assume that we have a map of name to JavaScript object. Hence, we are asking AngularJS to lookup that map with the name as 'myapp.settings.controller' and attach the JavaScript object to this UI. For now, just ignore how we are registering the object with AngularJS.
        • Also, we are using an alias 'ctrl' for that object.
      • Line 3: 
        • Here, we have used {{ctrl.administrator.name}}
        • This tells AngularJS that on the JavaScript object attached to this UI, we need to look for a property named administrator. Since, administrator value is a json we are further referring to the name property in it. 
        • The value for the administrator.name property will get populated on UI.
        • If the value changes in backend, then it will also get updated on UI.
        • Same is the case for {{ctrl.administrator.email}}
      • Line 5:
        • Here, on the checkbox, we have used the ng-model attribute.
        • The value for this attribute is the property on our JavaScript object.
        • It tells AngualrJS that value for this checkbox should be deduced from the showAdvanceOptions property on JavaScript object.
        • In addition to that it also tells that when the checkbox value changes from UI, say, on user click, then the value on JavaScript object should also be updated. This is called two- way binding in AngularJS.
      • Line 6:
        • Here we are displaying the value of showAdvanceOption property on the JavaScript object.
        • When the checkbox is checked then the value becomes true else false.
      • Line 7:
        • Here we are controlling the visibility of the div using the ng-show attribute of AngularJS.
        • It tells AngularJS that the div should be visible when the showAdvanceOptions property on the JavaScript object is true else it should be hidden.
        • Now, the showAdvanceOptions property is true/false depending on the state of the checkbox and hence the div will be visible if checkbox is checked else it will be hidden.

      Summary

      As seen above, using AngularJS the HTML markup defined what data needs to be rendered and how. The JavaScript object, called as controller, only needs to provide the required data to UI. A lot happens internally using AngularJS.
      In order to implement event a simple UI using AngularJS one needs to understand controllers, services, modules etc. But this post should help understand the very basic of how controller can be used to provide data to UI and how the HTML markups drive the UI flow.
      In next post we will have a simple MVC application using AngularJS and explore the controllers, services, modules etc.

      Tuesday, December 1, 2015

      Elasticsearch: What you see is NOT what you get !!!

      Recently, I troubleshooted two common issues with ES queries. Suddenly, the queries stopped returning data using some filter criteria. But when we looked at all the records for the type then the filter should have evaluated to true and returned data. We often forget that ES mostly returns original document which was indexed. But while indexing the fields in document there are analyzers being applied. So the field value we see in document is not exactly same as indexed value. 

      Consider the following simple example:
















      Line 1: We have created an employee record with displayName as "Ajey Dudhe".
      Line 6: We are retrieving all the employee records.

      If you see on the right hand side under _source the document is returned as it is. But does Elasticsearch see the value of displayName as "Ajey Dudhe". The answer is no. In this example, when the document is inserted then default analyzers are used. By default the string value will be broken into tokens and lower-cased. And this is the value which Elasticsearch sees. In order to have a look at the actual value visible to Elasticsearch we need to use the fielddata_fields option while fetching the documents as follows:


















      Line 22 on right hand side: Notice that the fielddata_fields now has values as seen by ES i.e. tokenized and lower-cased. In case the value is not at all indexed by ES then it will not appear under fields.

      Wednesday, April 22, 2015

      Elasticsearch: Taming the fielddata

      Introduction

      One of the bottlenecks in scaling-up Elasticsearch is fielddata. The term fielddata does not directly refer to data but represents the data structures & caching done by Elasticsearch for doing efficient look-ups. Apart from consuming considerable chunk of memory, fielddata also impacts queries. The impact is clearly visible when you have millions of documents to query. One of the recommended solutions is to scale-out. But even if you have multiple nodes, you need to make sure every single node is fine tuned otherwise you may have to keep adding considerable number of nodes as the data set increases. In following sections we will discuss how we can optimize the fielddata usage which, in turn, should help improve the memory usage & query performance for a single node.

      Use doc_value = true

      Main issue with fielddata is that it consumes huge memory. Using doc_value = true on a field in the mapping, tells Elasticsearch to use file system to store fielddata instead of memory. However, there is currently one limitation on string fields. Only non_analyzed string fields can have this option. What this means is that you cannot use doc_value = true with fields having analyzers like lower-case etc. defined on them. It becomes problem in some cases e.g. if you need to provide case-insensitive search for a field. Again, here the suggested approach it to use multi-fields mapping i.e. one analyzed and one non_analyzed field. But the moment we have analyzed field we cannot use doc_value = true on it which means that fielddata for it will be in-memory. To handle this we need to use transform script. In the transform script you need to do the operations like convert field to lower-case etc. In case you need to use analyzer then find out how you can invoke the analyzers explicitly from your transform script.
      While the documentation says using doc_value improves memory usage, it can impact the performance. But in our testing it was observed that the aggregation query performance increased drastically. May be it was due to the fact that now fielddata was being queried from file system cache.

      Handle sorting

      Sorting is another feature where fielddata is used. When sort is specified on a particular field, Elasticsearch needs to find out the terms for the document i.e. it needs fielddata. So to avoid it we can use script based sorting. The script can take field name as parameter and return the field value from the _source. Even though there is overhead is accessing the source, we observed around 50% improvement using script based sorting.


      Aggregation

      Aggregation is the main feature where fielddata is required. Elasticsearch provides Scripted Metric Aggregation but using script did not help here. Best is to avoid aggregation queries if not required. For example, use script filter if you need to query for distinct documents.

      Friday, January 30, 2015

      Elasticsearch: Get distinct\unique values using search instead of aggregation - Part 2

      The previous post described how custom script filter can be used to filter out duplicate documents in the result set. However, the approach would not work with multiple shards. To handle this case we need to have some plugin implementation to hook into Elasticsearch where we can filter out the duplicate documents fetched from multiple shards. But before that let's have a quick look at how distributed search works.

      Distributed search

      Elasticsearch documentation covers distributed search very well. Following is the summary:
      • Distributed search is executed in two phases: Query and Fetch phase
      • Query phase
        • The coordinating node sends the search request to all the shards.
        • Each shard executes the search and sends back list of document IDs and the sort values back to the coordinating node.
          • For example, get documents where file.folder = 'c:\windows' sorted by file.name
          • In above case, say, shard 1 matches three documents with file names as notepad.exe, textpad.exe, wordpad.exe. So the return values would be
            • DocID: some_doc_id_1, Sort values: [notepad.exe]
            • DocID: some_doc_id_2, Sort values: [textpad.exe]
            • DocID: some_doc_id_3, Sort values: [wordpad.exe]
        • All shards returns similar data back to the coordinating node.
        • The coordinating node now has list of fields on which sorting needs to be done and also the actual values for these fields per document.
        • This node then merges all the results into a single list.
        • The list is sorted using the sort values returned by each shard for each document.
        • The final list is prepared by applying pagination (from & size) parameters.
      • Fetch phase
        • Once the final list is prepared the coordinating node then sends the request to get the required documents from each shard using the document IDs i.e. send a multi-get request.

      Remove duplicate documents in distributed search

      As seen above if we need to filter out the duplicate documents for distributed search then we need to hook into the QUERY phase. Once the document IDs & sort values are retrieved for all the shards then we need to remove duplicate documents here. The steps to achieve this are as follows:
      • Apart from document Ids & sort values we also need the values for primary keys identifying the document as unique in the result set.
      • Once we have above information then we can remove duplicate documents using the primary keys. 

      Passing back the primary key values in query phase

      In QUERY phase for a document only its ID and sort values are passed back. The _source value is not passed. So only option to pass back the primary key values is using the sort values. For this we can use custom script for sorting. This custom script will be similar to the custom filter mentioned in previous post.
      • The script will take primary key field names as parameter.
      • The script will form a key by concatenating the values of primary key fields.
      • In the search request we will add this script based sort criteria at the end of existing sort fields, if any.
      • This way, for each document, we will get the key value during the QUERY phase itself on the coordinating node.

      Removing duplicate documents

      Now, we need to hook into the QUERY phase using plugin approach. Thanks to the reference plugin implementation here. By looking at this sample plugin I was able to deduce that the final sorting of docs from all shards happens in org.elasticsearch.search.controller.SearchPhaseController's sortDocs() method. This class is used deep inside and there is no direct way to hook into it. So I had to follow the same approach as in the reference plugin implementation i.e. to implement a custom action. In the plugin, the required classes and methods were extended\overridden. Finally, the class extending SearchPhaseController overrides the sortDocs() method. In this we remove the duplicate documents from across the shards and then call the original sortDocs() method.

      Source code

      You can start with the junit tests.

      Limitations

      • This approach extends the ES classes and methods which may change in future.
      • One major limitation is that the totalHits returned by the search request will not be accurate when pagination is used. This is because each shard will return the totalHits as the total documents matching the filter but actual documents returned can be less if pagination is used e.g. if total matches could be 100 but pagination criteria could be to return only 10 results. In this case the totalHits will be 100. Now, we do not know the remaining 90 documents are duplicate or not. But this should not be an issue if we can use the totalHits as an approximate count.