Skip to main content

Command Palette

Search for a command to run...

An Introduction to Elasticsearch and Data Storage and Retrieval using .NET

Updated
12 min read
An Introduction to Elasticsearch and Data Storage and Retrieval using .NET

Elasticsearch is a distributed, open-source search and analytics engine designed for scalability and performance. It belongs to the Elastic Stack, commonly referred to as ELK (Elasticsearch, Logstash, and Kibana), which is widely used for various data processing and visualization tasks.

Key Features of Elasticsearch:

  1. Distributed and Scalable:

    • Elasticsearch is inherently distributed, allowing it to scale horizontally across multiple nodes.

    • This distributed nature ensures high availability and fault tolerance.

  2. Real-time Data Processing:

    • Elasticsearch provides near real-time indexing and search capabilities.

    • Changes to the data are reflected in search results almost instantly.

  3. Full-text Search:

    • Originally designed for full-text search, Elasticsearch excels in matching and ranking textual data.
  4. Structured and Unstructured Data:

    • Elasticsearch can handle both structured and unstructured data.

    • It supports complex data structures with nested fields.

  5. RESTful API:

    • Interaction with Elasticsearch is performed through a RESTful API, making it accessible and easy to integrate with various programming languages.
  6. Aggregation and Analytics:

    • Elasticsearch includes powerful aggregation capabilities for summarizing and analyzing data.

Comparison with Relational Databases:

  1. Schema Flexibility:

    • Relational Databases (RDBMS): Require a predefined schema with fixed tables and columns.

    • Elasticsearch: Schema-free, allowing for dynamic mapping of fields. This flexibility is particularly beneficial for handling diverse and evolving data.

  2. Query Language:

    • RDBMS: Typically use SQL for querying.

    • Elasticsearch: Utilizes a JSON-based query language that is more flexible and expressive, especially for complex searches and aggregations.

  3. Scalability:

    • RDBMS: Vertical scaling (adding more resources to a single server) is common.

    • Elasticsearch: Horizontal scaling (adding more nodes to a cluster) is the norm, providing better scalability for large datasets and high query loads.

Comparison with Non-Relational Databases:

  1. Data Model:

    • Non-Relational Databases (NoSQL): Vary widely in terms of data models, including document, key-value, column-family, and graph databases.

    • Elasticsearch: Primarily a document-oriented database, storing JSON documents.

  2. Querying:

    • NoSQL: Querying mechanisms differ significantly between different types of NoSQL databases.

    • Elasticsearch: Specializes in full-text search and complex queries, making it particularly well-suited for scenarios where efficient and expressive search functionalities are critical. Its distributed architecture and extensive ecosystem contribute to its effectiveness in handling large datasets and diverse query requirements.

  3. Indexes and Sharding:

    • NoSQL: Indexing and sharding strategies vary.

    • Elasticsearch: Provides customizable indexing and sharding options, enabling efficient data distribution and retrieval.

Use Cases for Elasticsearch:

  1. Log and Event Data Analysis:

    • Elasticsearch is widely used for log and event data analysis, offering fast and efficient searches across vast amounts of log data.
  2. Search Engines:

    • Its roots in full-text search make Elasticsearch a popular choice for building search engines.
  3. Business Intelligence and Analytics:

    • Elasticsearch's aggregation capabilities make it suitable for business intelligence and analytics applications.
  4. Monitoring and Alerting:

    • It is commonly used for real-time monitoring and alerting systems.

Understanding Elasticsearch

Key Concepts:

  1. Index:

    • An index in Elasticsearch is similar to a database in a relational database management system (RDBMS). It is a collection of documents that share a common purpose.

    • Each document within an index represents a JSON object with key-value pairs.

  2. Document:

    • A document is the basic unit of data in Elasticsearch and is represented in JSON format.

    • It contains one or more fields, each with its own data type.

  3. Mapping:

    • Mapping defines the data type and other properties of each field in a document.

    • It is akin to a schema in a relational database.

Basic Elasticsearch Queries

Simple Query:

Let's consider a scenario where we have an index named "employees," and each document represents an employee:

{
  "id": 1,
  "name": "John Doe",
  "age": 35,
  "certifications": ["Java", "AWS"],
  "reportees": [2, 3]
}

To retrieve all documents:

GET /employees/_search
{
  "query": {
    "match_all": {}
  }
}
  • GET /employees/_search:
    This is an HTTP GET request sent to Elasticsearch, searching in the index named employees.

  • "query": { "match_all": {} }:
    This is the core of the request — a match_all query. It matches all documents in the specified index (employees here) without any filtering or condition.

Search by Field:

To search for employees of age 40 or above, but not more than 50:

GET /employees/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 40,
        "lte": 50
      }
    }
  }
}
  • The range query finds documents where the value of the specified field falls within a given numerical or date range.

  • Here, the field is age.

  • "gte": 40 means "greater than or equal to 40".

  • "lte": 50 means "less than or equal to 50".

  • So, it matches all documents where the age field's value is between 40 and 50.

  • You should use a range query when you want to find documents where a field's value falls within a specific numeric, date, or IP range.

Elasticsearch Queries in .NET C#

Now, let's implement these queries in .NET C# using the NEST library. There is also a more modern nuget package named Elastic.Clients.Elasticsearch, but it was having some compatibility issues as well as breaking changes, so we are sticking with NEST for the moment.

Setting Up NEST:

First, install the NEST NuGet package in your .NET project:

dotnet add package NEST

Retrieving All Documents:

using Nest;

var settings = new ConnectionSettings(new Uri("http://your-elasticsearch-server:9200"))
    .DefaultIndex("employees");

var client = new ElasticClient(settings);

var searchResponse = await client.SearchAsync<Employee>(s => s
    .Query(q => q.MatchAll())
);

// Process searchResponse as needed

Retrieving Employees aged 40 and above:

var searchResponse = await client.SearchAsync<Employee>(s => s
    .Query(q => q
        .Range(r => r
            .Field(f => f.Age)
            .GreaterThanOrEquals(40)
        )
    )
);

// Process searchResponse as needed

Utilizing Search Descriptor

A Search Descriptor in Elasticsearch is a way to construct complex queries programmatically, allowing for more flexibility and customization.

Example Using NEST:

Consider the following C# code using NEST to retrieve employees over the age of 40:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
        .Range(r => r
            .Field(f => f.Age)
            .GreaterThanOrEquals(41)
        )
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

Selective Field Retrieval

Limiting Fields in the Response:

To retrieve only specific fields, use the Source method in the Search Descriptor:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Source(src => src
        .Includes(fields => fields
            .Field(f => f.Id)
            .Field(f => f.Name)
            .Field(f => f.Age)
            .Field(f => f.Reportees)
        )
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

Handling Large Result Sets with Scroll

Elasticsearch imposes a limit on the number of results returned in a single request. The Scroll API is employed for efficient pagination of large result sets.

Example Code with Scroll:

Here's an example of using the Scroll API to retrieve all employees:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Size(1000) // Batch size
    .Scroll("5m"); // Scroll time

var allEmployees = new List<Employee>();

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

do
{
    allEmployees.AddRange(searchResponse.Documents);

    var scrollRequest = new ScrollRequest(searchResponse.ScrollId, "5m");
    searchResponse = await client.ScrollAsync<Employee>(scrollRequest);

} while (searchResponse.IsValid && searchResponse.Documents.Any());

await client.ClearScrollAsync(new ClearScrollRequest(searchResponse.ScrollId));

Here, we are adding a Size and Scroll in the searchDescriptor. Later, we are running a do-while loop in which at every iteration, we are reading the current batch of documents from searchResponse.Documents and then creating a new scrollRequest to fetch the next batch of documents. The loop keeps running till there are no more documents to read. At the end, we clear the scroll by calling client.ClearScrollAsync.

Enhancing Elasticsearch Queries

Querying by Multiple Conditions:

Sometimes, you may need to combine conditions for more refined queries. For instance, to find employees aged between 30 and 40 with specific certifications:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
        .Bool(b => b
            .Must(m => m
                       .Range(r => r
                         .Field(f => f.Age)
                           .GreaterThanOrEquals(30)
                              .LessThanOrEquals(40)),
                  m => m
                       .Terms(t => t
                         .Field(f => f
                            .Certifications)
                              .Terms("Java", "AWS"))
            )
        )
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

Another example is below, where we are searching all employees in the IT department who have Java and AWS certifications. Here, instead of using the Bool function, we are directly combining queries using the && operator to signify that all conditions must be met.

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
                .Term(t => t
                  .Field(f => f.Department)
                    .Value("IT")) &&
                q
                .Terms(t => t
                  .Field(f => f.Certifications)
                    .Terms("Java", "AWS"))
          );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

Querying by Conditions on Nested fields:

Sometimes, you may need to add conditions for nested fields. Below we are trying to find out all employees who have "Java" as one of the Skills and the Level as "Expert".

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
                .Nested(nst => nst
                  .Path(p => p.Skills)
                    .Query(q => 
                       q   
                         .Term(t => t
                           .Field(f => f.Skills.First().Name)
                             .Value("Java")) &&
                       q
                         .Term(t => t
                           .Field(f => f.Skills.First().Level)
                             .Value("Expert"))
                     )
                 )   
            );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

When dealing with potential typos or variations in the data, a fuzzy search can be useful. For instance, searching for employees with a name similar to "John":

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
        .Fuzzy(f => f
            .Field(p => p.Name)
            .Value("John")
            .Fuzziness(Fuzziness.Auto)
        )
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
  • .Query(q => q.Fuzzy(...)): Builds a fuzzy query, which matches terms similar but not identical to "John".

  • .Field(p => p.Name): Specifies the field Name (e.g., firstname) in documents to search.

  • .Value("John"): The search term to fuzzy match.

  • .Fuzziness(Fuzziness.Auto): Automatically determines fuzziness based on the search term length:

    • Short terms (<= 2 characters) require exact matches.

    • Medium terms allow 1 edit (insert, delete, substitute).

    • Longer terms allow up to 2 edits.

The result is a search that tolerates spelling mistakes or minor differences from "John" in the Name field.

Say if we want fuzzy search for both first name and last name:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
        .Bool(b => b
            .Should(
                sh => sh.Fuzzy(f => f.Field(p => p.FirstName).Value("John").Fuzziness(Fuzziness.Auto)),
                sh => sh.Fuzzy(f => f.Field(p => p.LastName).Value("Doe").Fuzziness(Fuzziness.Auto))
            )
        )
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

We can also provide a score threshold, so that only those matches are shown that are above that threshold. Now what value to provide for this threshold, we can only decide after some trial and error, as the threshold will vary for different data sets as well as within same data set for different field combinations.

For example, say you start with a threshold value of 10.0, but then you see lot of unnecessary matches showing up, so you increase the value to 50.0, but then there are no matches showing up. So you decide to go with 25.0, and it seems to show up only close matches, and so you go with this value.

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
        .Match(m => m
            .Field(f => f.Name)
            .Query("John")
            .Fuzziness(Fuzziness.Auto)
        )
    )
    .MinScore(25.0);  //match score should be at least 25.0 (decided after some trial and error)

Aggregations for Data Insights

Terms Aggregation:

Aggregations in Elasticsearch allow you to analyze and summarize data. A terms aggregation, for example, can help find the distribution of certifications among employees:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Aggregations(a => a
        .Terms("certifications", t => t.Field(f => f.Certifications))
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
var certificationsAggregation = searchResponse.Aggregations.Terms("certifications");

foreach (var bucket in certificationsAggregation.Buckets)
{
    Console.WriteLine($"Certification: {bucket.Key}, Count: {bucket.DocCount}");
}

Sorting Results

Sorting by Field:

Sorting results based on a specific field, such as sorting employees by age in descending order:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Sort(s => s.Descending(p => p.Age));

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

Boosting Relevance

Boosting:

Boosting allows you to influence the relevance score of documents. For example, boosting employees with the "AWS" certification:

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
        .Bool(b => b
            .Should(
                s => s.Term(t => t.Field(f => f.Certifications).Value("AWS")).Boost(2),
                s => s.MatchAll()
            )
        )
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
  • The .Boost(2) function increases the relevance score of documents matching the Term query on the Certifications field with the value "AWS" by a factor of 2.

  • This means documents having "AWS" certification will be considered twice as important or relevant compared to documents matched only by the other clause (MatchAll()).

  • Boost is a way to weight or prioritize certain query clauses over others within a composed query.

Handling Textual Searches

For more complex textual searches, Elasticsearch provides a powerful full-text search capability. For example, searching for employees with a name containing "John" or "Doe":

var searchDescriptor = new SearchDescriptor<Employee>()
    .Query(q => q
        .Match(m => m
            .Field(f => f.Name)
            .Query("John Doe")
        )
    );

var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);

Adding Data

Say we want to add a new employee in the Employee index. We will call the IndexDocumentAsync() function of ElasticClient.

public async Task AddAsync(Employee employee)
{
        //Add the new employee in the employee index
        var indexResponse = await client.IndexDocumentAsync(employee);

        if (indexResponse.IsValid)
        {
            Console.WriteLine("Document indexed successfully.");
        }
        else
        {
            Console.WriteLine($"Error indexing document: {indexResponse.DebugInformation}");
        }
}

Updating Data

Updating employee object

Say we read an employee from index, then update some data, and then pass it back to index to update. We will call the UpdateAsync() function of ElasticClient. This will overwrite the entire Employee object for this employee in the index.

public async Task UpdateAsync(Employee employee, CancellationToken cancellationToken)
{
        //Update the employee in the employee index
        var response = await client.UpdateAsync(
                        new DocumentPath<Employee>(employee),
                        u => u.Doc(employee).Index("employee_index"),
                        cancellationToken);

        if (response.IsValid)
        {
            Console.WriteLine("Document updated successfully.");
        }
        else
        {
            throw response.OriginalException;
        }
}

Updating a specific property in Employee object

Say we read an employee from index, then update Skills data, and now we want only the Skills data to be updated in the index, then we can call the UpdateByQueryAsync() function combined with the Script() function to update only a specific property. This ensures that the entire object does not get overwritten in the index. We will need to provide a search criteria to ensure that we only update the correct employee's data, by searching for employee id.

public async Task UpdateSkillsAsync(Employee employee, CancellationToken cancellationToken)
{
        //Update the skills property of the employee in the employee index
        var response = await client.UpdateByQueryAsync<Patient>( u => u
                          .Query(q => q
                            .Term(t => t
                               .Field(f => f.EmployeeId)
                                 .Value(employee.EmployeeId)))
                          .Script(s => s
                            .Inline("ctx._source.Skills = params.value")  //Use Source instead of Inline for newer NEST nuget versions
                              .Params(p => p.Add("value", employee.Skills)))
                          .Refresh()
                          .Index("employee_index"), cancellationToken);

        if (response.IsValid)
        {
            Console.WriteLine("Document updated successfully.");
        }
        else
        {
            throw response.OriginalException;
        }
}

Deleting Data

You can easily delete an employee data by calling DeleteAsync

public async Task DeleteAsync(string employeeId, CancellationToken cancellationToken)
{
        //Delete the employee in the employee index
        var response = await client.DeleteAsync<Patient>( employeeId,
                            idx => idx.Index("employee_index"), 
                            cancellationToken);
        if (response.IsValid)
        {
            Console.WriteLine("Document deleted successfully.");
        }
        else
        {
            throw response.OriginalException;
        }
}

Conclusion: Mastering Elasticsearch Data Retrieval and storage

Elasticsearch, when coupled with .NET and the NEST library, empowers developers to handle a wide array of scenarios for data retrieval and storage. From combining conditions in queries to leveraging aggregations for insights, sorting, boosting relevance, and performing full-text searches, Elasticsearch accommodates various use cases seamlessly.