An Introduction to Elasticsearch and Data Storage and Retrieval using .NET

Elasticsearch is a distributed, open-source search and analytics engine designed for scalability and performance. It belongs to the Elastic Stack, commonly referred to as ELK (Elasticsearch, Logstash, and Kibana), which is widely used for various data processing and visualization tasks.
Key Features of Elasticsearch:
Distributed and Scalable:
Elasticsearch is inherently distributed, allowing it to scale horizontally across multiple nodes.
This distributed nature ensures high availability and fault tolerance.
Real-time Data Processing:
Elasticsearch provides near real-time indexing and search capabilities.
Changes to the data are reflected in search results almost instantly.
Full-text Search:
- Originally designed for full-text search, Elasticsearch excels in matching and ranking textual data.
Structured and Unstructured Data:
Elasticsearch can handle both structured and unstructured data.
It supports complex data structures with nested fields.
RESTful API:
- Interaction with Elasticsearch is performed through a RESTful API, making it accessible and easy to integrate with various programming languages.
Aggregation and Analytics:
- Elasticsearch includes powerful aggregation capabilities for summarizing and analyzing data.
Comparison with Relational Databases:
Schema Flexibility:
Relational Databases (RDBMS): Require a predefined schema with fixed tables and columns.
Elasticsearch: Schema-free, allowing for dynamic mapping of fields. This flexibility is particularly beneficial for handling diverse and evolving data.
Query Language:
RDBMS: Typically use SQL for querying.
Elasticsearch: Utilizes a JSON-based query language that is more flexible and expressive, especially for complex searches and aggregations.
Scalability:
RDBMS: Vertical scaling (adding more resources to a single server) is common.
Elasticsearch: Horizontal scaling (adding more nodes to a cluster) is the norm, providing better scalability for large datasets and high query loads.
Comparison with Non-Relational Databases:
Data Model:
Non-Relational Databases (NoSQL): Vary widely in terms of data models, including document, key-value, column-family, and graph databases.
Elasticsearch: Primarily a document-oriented database, storing JSON documents.
Querying:
NoSQL: Querying mechanisms differ significantly between different types of NoSQL databases.
Elasticsearch: Specializes in full-text search and complex queries, making it particularly well-suited for scenarios where efficient and expressive search functionalities are critical. Its distributed architecture and extensive ecosystem contribute to its effectiveness in handling large datasets and diverse query requirements.
Indexes and Sharding:
NoSQL: Indexing and sharding strategies vary.
Elasticsearch: Provides customizable indexing and sharding options, enabling efficient data distribution and retrieval.
Use Cases for Elasticsearch:
Log and Event Data Analysis:
- Elasticsearch is widely used for log and event data analysis, offering fast and efficient searches across vast amounts of log data.
Search Engines:
- Its roots in full-text search make Elasticsearch a popular choice for building search engines.
Business Intelligence and Analytics:
- Elasticsearch's aggregation capabilities make it suitable for business intelligence and analytics applications.
Monitoring and Alerting:
- It is commonly used for real-time monitoring and alerting systems.
Understanding Elasticsearch
Key Concepts:
Index:
An index in Elasticsearch is similar to a database in a relational database management system (RDBMS). It is a collection of documents that share a common purpose.
Each document within an index represents a JSON object with key-value pairs.
Document:
A document is the basic unit of data in Elasticsearch and is represented in JSON format.
It contains one or more fields, each with its own data type.
Mapping:
Mapping defines the data type and other properties of each field in a document.
It is akin to a schema in a relational database.
Basic Elasticsearch Queries
Simple Query:
Let's consider a scenario where we have an index named "employees," and each document represents an employee:
{
"id": 1,
"name": "John Doe",
"age": 35,
"certifications": ["Java", "AWS"],
"reportees": [2, 3]
}
To retrieve all documents:
GET /employees/_search
{
"query": {
"match_all": {}
}
}
GET /employees/_search:
This is an HTTP GET request sent to Elasticsearch, searching in the index namedemployees."query": { "match_all": {} }:
This is the core of the request — a match_all query. It matches all documents in the specified index (employeeshere) without any filtering or condition.
Search by Field:
To search for employees of age 40 or above, but not more than 50:
GET /employees/_search
{
"query": {
"range": {
"age": {
"gte": 40,
"lte": 50
}
}
}
}
The
rangequery finds documents where the value of the specified field falls within a given numerical or date range.Here, the field is
age."gte": 40means "greater than or equal to 40"."lte": 50means "less than or equal to 50".So, it matches all documents where the
agefield's value is between 40 and 50.You should use a range query when you want to find documents where a field's value falls within a specific numeric, date, or IP range.
Elasticsearch Queries in .NET C#
Now, let's implement these queries in .NET C# using the NEST library. There is also a more modern nuget package named Elastic.Clients.Elasticsearch, but it was having some compatibility issues as well as breaking changes, so we are sticking with NEST for the moment.
Setting Up NEST:
First, install the NEST NuGet package in your .NET project:
dotnet add package NEST
Retrieving All Documents:
using Nest;
var settings = new ConnectionSettings(new Uri("http://your-elasticsearch-server:9200"))
.DefaultIndex("employees");
var client = new ElasticClient(settings);
var searchResponse = await client.SearchAsync<Employee>(s => s
.Query(q => q.MatchAll())
);
// Process searchResponse as needed
Retrieving Employees aged 40 and above:
var searchResponse = await client.SearchAsync<Employee>(s => s
.Query(q => q
.Range(r => r
.Field(f => f.Age)
.GreaterThanOrEquals(40)
)
)
);
// Process searchResponse as needed
Utilizing Search Descriptor
A Search Descriptor in Elasticsearch is a way to construct complex queries programmatically, allowing for more flexibility and customization.
Example Using NEST:
Consider the following C# code using NEST to retrieve employees over the age of 40:
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Range(r => r
.Field(f => f.Age)
.GreaterThanOrEquals(41)
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
Selective Field Retrieval
Limiting Fields in the Response:
To retrieve only specific fields, use the Source method in the Search Descriptor:
var searchDescriptor = new SearchDescriptor<Employee>()
.Source(src => src
.Includes(fields => fields
.Field(f => f.Id)
.Field(f => f.Name)
.Field(f => f.Age)
.Field(f => f.Reportees)
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
Handling Large Result Sets with Scroll
Elasticsearch imposes a limit on the number of results returned in a single request. The Scroll API is employed for efficient pagination of large result sets.
Example Code with Scroll:
Here's an example of using the Scroll API to retrieve all employees:
var searchDescriptor = new SearchDescriptor<Employee>()
.Size(1000) // Batch size
.Scroll("5m"); // Scroll time
var allEmployees = new List<Employee>();
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
do
{
allEmployees.AddRange(searchResponse.Documents);
var scrollRequest = new ScrollRequest(searchResponse.ScrollId, "5m");
searchResponse = await client.ScrollAsync<Employee>(scrollRequest);
} while (searchResponse.IsValid && searchResponse.Documents.Any());
await client.ClearScrollAsync(new ClearScrollRequest(searchResponse.ScrollId));
Here, we are adding a Size and Scroll in the searchDescriptor. Later, we are running a do-while loop in which at every iteration, we are reading the current batch of documents from searchResponse.Documents and then creating a new scrollRequest to fetch the next batch of documents. The loop keeps running till there are no more documents to read. At the end, we clear the scroll by calling client.ClearScrollAsync.
Enhancing Elasticsearch Queries
Querying by Multiple Conditions:
Sometimes, you may need to combine conditions for more refined queries. For instance, to find employees aged between 30 and 40 with specific certifications:
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Bool(b => b
.Must(m => m
.Range(r => r
.Field(f => f.Age)
.GreaterThanOrEquals(30)
.LessThanOrEquals(40)),
m => m
.Terms(t => t
.Field(f => f
.Certifications)
.Terms("Java", "AWS"))
)
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
Another example is below, where we are searching all employees in the IT department who have Java and AWS certifications. Here, instead of using the Bool function, we are directly combining queries using the && operator to signify that all conditions must be met.
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Term(t => t
.Field(f => f.Department)
.Value("IT")) &&
q
.Terms(t => t
.Field(f => f.Certifications)
.Terms("Java", "AWS"))
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
Querying by Conditions on Nested fields:
Sometimes, you may need to add conditions for nested fields. Below we are trying to find out all employees who have "Java" as one of the Skills and the Level as "Expert".
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Nested(nst => nst
.Path(p => p.Skills)
.Query(q =>
q
.Term(t => t
.Field(f => f.Skills.First().Name)
.Value("Java")) &&
q
.Term(t => t
.Field(f => f.Skills.First().Level)
.Value("Expert"))
)
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
Fuzzy Search:
When dealing with potential typos or variations in the data, a fuzzy search can be useful. For instance, searching for employees with a name similar to "John":
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Fuzzy(f => f
.Field(p => p.Name)
.Value("John")
.Fuzziness(Fuzziness.Auto)
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
.Query(q => q.Fuzzy(...)): Builds a fuzzy query, which matches terms similar but not identical to"John"..Field(p => p.Name): Specifies the fieldName(e.g., firstname) in documents to search..Value("John"): The search term to fuzzy match..Fuzziness(Fuzziness.Auto): Automatically determines fuzziness based on the search term length:Short terms (<= 2 characters) require exact matches.
Medium terms allow 1 edit (insert, delete, substitute).
Longer terms allow up to 2 edits.
The result is a search that tolerates spelling mistakes or minor differences from "John" in the Name field.
Say if we want fuzzy search for both first name and last name:
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Bool(b => b
.Should(
sh => sh.Fuzzy(f => f.Field(p => p.FirstName).Value("John").Fuzziness(Fuzziness.Auto)),
sh => sh.Fuzzy(f => f.Field(p => p.LastName).Value("Doe").Fuzziness(Fuzziness.Auto))
)
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
We can also provide a score threshold, so that only those matches are shown that are above that threshold. Now what value to provide for this threshold, we can only decide after some trial and error, as the threshold will vary for different data sets as well as within same data set for different field combinations.
For example, say you start with a threshold value of 10.0, but then you see lot of unnecessary matches showing up, so you increase the value to 50.0, but then there are no matches showing up. So you decide to go with 25.0, and it seems to show up only close matches, and so you go with this value.
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Match(m => m
.Field(f => f.Name)
.Query("John")
.Fuzziness(Fuzziness.Auto)
)
)
.MinScore(25.0); //match score should be at least 25.0 (decided after some trial and error)
Aggregations for Data Insights
Terms Aggregation:
Aggregations in Elasticsearch allow you to analyze and summarize data. A terms aggregation, for example, can help find the distribution of certifications among employees:
var searchDescriptor = new SearchDescriptor<Employee>()
.Aggregations(a => a
.Terms("certifications", t => t.Field(f => f.Certifications))
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
var certificationsAggregation = searchResponse.Aggregations.Terms("certifications");
foreach (var bucket in certificationsAggregation.Buckets)
{
Console.WriteLine($"Certification: {bucket.Key}, Count: {bucket.DocCount}");
}
Sorting Results
Sorting by Field:
Sorting results based on a specific field, such as sorting employees by age in descending order:
var searchDescriptor = new SearchDescriptor<Employee>()
.Sort(s => s.Descending(p => p.Age));
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
Boosting Relevance
Boosting:
Boosting allows you to influence the relevance score of documents. For example, boosting employees with the "AWS" certification:
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Bool(b => b
.Should(
s => s.Term(t => t.Field(f => f.Certifications).Value("AWS")).Boost(2),
s => s.MatchAll()
)
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
The
.Boost(2)function increases the relevance score of documents matching theTermquery on theCertificationsfield with the value"AWS"by a factor of 2.This means documents having
"AWS"certification will be considered twice as important or relevant compared to documents matched only by the other clause (MatchAll()).Boost is a way to weight or prioritize certain query clauses over others within a composed query.
Handling Textual Searches
Full-Text Search:
For more complex textual searches, Elasticsearch provides a powerful full-text search capability. For example, searching for employees with a name containing "John" or "Doe":
var searchDescriptor = new SearchDescriptor<Employee>()
.Query(q => q
.Match(m => m
.Field(f => f.Name)
.Query("John Doe")
)
);
var searchResponse = await client.SearchAsync<Employee>(searchDescriptor);
Adding Data
Say we want to add a new employee in the Employee index. We will call the IndexDocumentAsync() function of ElasticClient.
public async Task AddAsync(Employee employee)
{
//Add the new employee in the employee index
var indexResponse = await client.IndexDocumentAsync(employee);
if (indexResponse.IsValid)
{
Console.WriteLine("Document indexed successfully.");
}
else
{
Console.WriteLine($"Error indexing document: {indexResponse.DebugInformation}");
}
}
Updating Data
Updating employee object
Say we read an employee from index, then update some data, and then pass it back to index to update. We will call the UpdateAsync() function of ElasticClient. This will overwrite the entire Employee object for this employee in the index.
public async Task UpdateAsync(Employee employee, CancellationToken cancellationToken)
{
//Update the employee in the employee index
var response = await client.UpdateAsync(
new DocumentPath<Employee>(employee),
u => u.Doc(employee).Index("employee_index"),
cancellationToken);
if (response.IsValid)
{
Console.WriteLine("Document updated successfully.");
}
else
{
throw response.OriginalException;
}
}
Updating a specific property in Employee object
Say we read an employee from index, then update Skills data, and now we want only the Skills data to be updated in the index, then we can call the UpdateByQueryAsync() function combined with the Script() function to update only a specific property. This ensures that the entire object does not get overwritten in the index. We will need to provide a search criteria to ensure that we only update the correct employee's data, by searching for employee id.
public async Task UpdateSkillsAsync(Employee employee, CancellationToken cancellationToken)
{
//Update the skills property of the employee in the employee index
var response = await client.UpdateByQueryAsync<Patient>( u => u
.Query(q => q
.Term(t => t
.Field(f => f.EmployeeId)
.Value(employee.EmployeeId)))
.Script(s => s
.Inline("ctx._source.Skills = params.value") //Use Source instead of Inline for newer NEST nuget versions
.Params(p => p.Add("value", employee.Skills)))
.Refresh()
.Index("employee_index"), cancellationToken);
if (response.IsValid)
{
Console.WriteLine("Document updated successfully.");
}
else
{
throw response.OriginalException;
}
}
Deleting Data
You can easily delete an employee data by calling DeleteAsync
public async Task DeleteAsync(string employeeId, CancellationToken cancellationToken)
{
//Delete the employee in the employee index
var response = await client.DeleteAsync<Patient>( employeeId,
idx => idx.Index("employee_index"),
cancellationToken);
if (response.IsValid)
{
Console.WriteLine("Document deleted successfully.");
}
else
{
throw response.OriginalException;
}
}
Conclusion: Mastering Elasticsearch Data Retrieval and storage
Elasticsearch, when coupled with .NET and the NEST library, empowers developers to handle a wide array of scenarios for data retrieval and storage. From combining conditions in queries to leveraging aggregations for insights, sorting, boosting relevance, and performing full-text searches, Elasticsearch accommodates various use cases seamlessly.



