Observability with the ELK Stack

2,875
Elastic
Creators of ELK / Elastic Stack (Elasticsearch, Logstash, Kibana, Beats & More)

Written By Tanya Bragin, Product Lead, Elastic


In my role as a Product Lead for Observability at Elastic, I get a few different reactions when I use the term 'observability'. The most common reaction by far today still is: "What is 'observability'?" But I also increasingly hear things like: "We just kicked-off an 'observability initiative', but we're still figuring out exactly how to go about it." And finally, some organizations we have been fortunate to work with already consider 'observability' an integral part of how they design and build products and services.

Given that the term is still gaining traction, I thought it would be useful to demystify how we at Elastic view 'observability', what we learned from our thought-leading customers, and how we think about it from the product perspective as we evolve our stack for operational use cases.

What is 'Observability'?

We certainly did not invent the term 'observability'. We started hearing about it from users, primarily those within the Site Reliability Engineering (SRE) community. Several sources trace back beginnings of this term to SRE organizations from Silicon Valley giants like Twitter. And even though the seminal Google SRE Book does not mention the term, it lays out many of the principles associated with 'observability' today.

'Observability' is not something that a vendor delivers in a box -- it is an attribute of a system you build, much like usability, high availability, and stability. The goal of designing and building an 'observable' system is to make sure that when it is run in production, operators responsible for it can detect undesirable behaviors (e.g. service downtime, errors, slow responses) and have actionable information to pin down root cause in an effective manner (e.g. detailed event logs, granular resource usage information, and application traces). Common challenges preventing organizations from achieving this seemingly obvious goals include not collecting enough information, collecting too much information, but not making it actionable, and fragmenting access to this information.

The first aspect — detection of undesirable behaviors — usually starts with setting of Service Level Indicators (SLIs) and Objectives (SLOs). These are internal measures of success by which production systems are judged in observability-minded organizations. If there is a contractual obligation to fulfill these objectives, an SLI/SLO may also translate to a Service Level Agreements (SLAs). The most common example of an SLI is system uptime, for which you may set an SLO of 99.9999%. System uptime is also the most common SLA exposed to external customers. However, your SLI/SLOs internally may be a lot more granular, and monitoring and alerting on these most important factors of production system behavior is the basis of any observability initiative. This aspect of observability is also known by the term "monitoring".

The second aspect — providing operators with granular information to debug production issues quickly and efficiently — is an area where we see a lot of movement and innovation. There is quite a bit of talk about the "three pillars of observability" — metrics, logs, and application traces. There is also recognition that simply collecting all this granular data using a patchwork of tools is not necessarily actionable and often not cost effective.

'Pillars' of Observability

Let's examine these data collection aspects in more detail. The status quo we typically encounter today is to collect metrics into one system (usually a time series database or a SaaS service for resource monitoring), collect logs into a second system (unsurprisingly, often the ELK stack in our conversations), and to use yet a third tool to instrument applications to provide request level tracing. When an alert fires, indicating a breach in a service level, operators madly dart over to their systems and perform the best "swivel chair integration" they can -- looking at metrics in one browser window, manually correlating it to logs in another window, and pulling up traces (if relevant) in yet a third window.

This approach has several drawbacks. First, manual correlation of different data sources all telling the same story wastes valuable time during service degradation or outage. Second, operational costs of maintaining three different operational data stores are onerous — licensing costs, separate headcount for administrators of disparate operational tools, inconsistent machine learning capabilities in each datastore, "headspace" for thinking through different semantics for alerting — every organization I speak with struggles with all of these challenges.

There is an increasing recognition of how important it is to have all this information in a single operational store with the ability to automatically correlate this data in an intuitive user interface. Nirvana for the users we talk to is to expose their operators to every piece of data relevant to the service they are supporting in a unified way, whether it be a log line emitted by the application, trace data resulting from instrumentation, or resource utilization represented by metrics in a time series. Requirements we hear about stress uniform, ad-hoc access to this data regardless of the source, from search and filtering, to aggregations, to visualizations. Starting with metrics and drilling into logs and traces in a few clicks without switching context accelerates investigations. Similarly, extracting numerical values from structured logs looks surprisingly like metrics and visualizing both side-by-side has tremendous value from an operational perspective.

As mentioned before, simply collecting the data may result in too much information on disk and not enough actionable intelligence when an incident occurs. Increasingly, there is an expectation that the system collecting operational data provides automatic detection of "interesting" events, traces, and anomalies in the patterns of time series. This helps operators investigating a problem zero in on the root cause faster. These anomaly detection capabilities are sometimes referred to as the "fourth pillar of observability". Detecting anomalies across uptime data, resource utilization, anomalies in logging patterns, and most relevant traces is an emerging requirement observability teams put forth.

Observability... and the ELK Stack?

So what does observability have to do with the Elastic Stack (or ELK Stack, as it's lovingly referred to in operational circles)?

ELK Stack is widely known as the de facto way to centralize logs from operational systems. The assumption is that Elasticsearch (a "search engine") is a good place to put text-based logs for the purposes of free-text search. And indeed, simply searching text-based logs for the word "error" or filtering logs based on a set of a well-known tags is extremely powerful, and is often where most users start.

However, as most ELK Stack users know, Elasticsearch as a datastore offers a lot more than an inverted index for efficient full-text search and simple filtering abilities. It also contains a columnar store optimized for storing and operating on dense numerical time series. This columnar store is used to store structure data extracted from parsed logs, both string and numerical. In fact, the use case of converting logs to metrics is what initially drove us to optimize Elasticsearch for efficient storage and retrieval of numbers.

Over time, users started putting numerical time series directly into Elasticsearch, replacing legacy time series databases. Driven by this need, Elastic recently introduced Metricbeat for automated collection of metrics, the concept of automatic rollups, and other metrics-specific functionality both in the datastore and the UI. As a result, increasingly more users that have adopted the ELK Stack for logs, have also started putting metric data, such as resource utilization, into the Elastic Stack. In addition to operational savings already mentioned above, one attractive reason for this was lack of restrictions Elasticsearch places on cardinality of fields eligible for numerical aggregations (a common gripe brought up when discussing many existing time series databases).

Similar to metrics, uptime data has been a highly valued type of data alongside logs, representing an important source of SLO/SLI alerts from an active monitor. Uptime data can provide information about degradation of services, APIs, and websites, oftentimes before the users feel the impact. The bonus is that uptime data is tiny in terms of storage requirements, so a lot of value for very little additional cost.

Within the past year Elastic has also introduced Elastic APM, adding application tracing and distributed tracing capabilities to the stack. This was a natural evolution for us, as several open-source projects and prominent APM vendors were already using Elasticsearch to store and search trace data. Status quo in traditional APM tools is to keep APM trace data separate from logs and metrics, perpetuating operational data silos. Elastic APM offers a set of agents for collecting trace data from supported languages and frameworks as well as supporting OpenTracing, and this trace data is automatically correlated with the metrics and logs.

A common thread across all these data inputs is that each of them is just another index in Elasticsearch. There are no restrictions on aggregations you run on all this data data, how you visualize it in Kibana, and how alerting and machine learning applies to each data source. To see this in action, check out this video.

Observable Kubernetes and the Elastic Stack

One community where the concept of observability is a very active topic of conversation is the set of users adopting Kubernetes for container orchestration. These "cloud native" users, a term popularized by the Cloud Native Computing Foundation (or CNCF), face unique challenges. They face a massive centralization of applications and services built on or migrated to a Kubernetes-powered container orchestration platform, coupled with the trend to split up monolithic apps into "microservices". Tools and methods that worked before to provide necessary visibility into applications running on top of this infrastructure no longer work.

Kubernetes observability deserves a separate post all on its own, so for now I will refer you to the Observable Kubernetes webinar and the Distributed Tracing with Elastic APM blog post for more information.

What's next?

In a post like this, it seems appropriate to leave the reader with a few resources to explore.

To learn more about observability best practices, I recommend starting with the above-mentioned Google SRE Book. Blog posts from companies whose livelihood depends on flawless operation of their critical apps in production are also typically very thought-provoking. For example, I find this recent post by Salesforce engineering to be a pragmatic and practical guide to iteratively improving the state of observability.

To try out Elastic Stack capabilities for your observability initiatives, spin up the latest version of our stack on the Elasticsearch Service on Elastic Cloud (great sandbox even if ultimately you deploy self-managed), or download and install Elastic Stack components locally. Make sure to check out the new Logs, Infrastructuremonitoring, APM, and Uptime (coming soon in 6.7) UIs in Kibana, purpose-built for common observability workflows. And feel free to ping us with questions on Discuss forums — we're there to help!

Elastic
Creators of ELK / Elastic Stack (Elasticsearch, Logstash, Kibana, Beats & More)
Tools mentioned in article
Open jobs at Elastic
Elasticsearch - Principal Java Engine...
Distributed, AMER | Distributed, EMEA

We're looking for a Principal Java Engineer to join the Elasticsearch - Security team, focusing on making Elasticsearch more secure for our user community.  This is a principal software engineering role that covers the design and implementation of new features, enhancements to existing features, and resolving bugs. We design and write code (including automated tests) and documentation. We review one another’s code via GitHub pull requests, and we investigate and fix bugs. We do all of this transparently on GitHub.

The Elasticsearch Security team is responsible for a range of security features, including Identity and Access management, Auditing, TLS and Certificate Management, and Cryptography. We’re the team that builds the sorts of things that show up under a “Security” heading on a product feature list. Here’s exactly that list for the Elastic Stack. This role will provide opportunities to learn more about each of these areas of security, and to influence how these features are used within Elastic's Cloud, and our enterprise search, observability, and security solutions.

Elasticsearch is a distributed application written in Java, dedicated to performance and scalability. We’re looking for Senior Java engineers who are able to design new product features while thinking through the concurrency and performance implications of those designs.

What You Will Be Doing:

  • Working in a hands-on capacity on a successful, high profile Java project used throughout the world for multiple use cases
  • Applying your experience in software development to design and build new security features in Elasticsearch, that strike a balance between usability, performance and security trade-offs,
  • Evolving the existing authentication and authorization features of Elasticsearch and the Elastic Stack.
  • Working with other teams across Elastic to build and expand the foundation of security for Elastic's products.
  • Prototyping new ideas and experimenting openly.
  • Collaborating in the open with the Elasticsearch team, Elastic Stack users, and others supporting open source projects.
  • Working with the community on bugs and performance issues and assisting support engineers with tougher customer issues.

What You Bring Along:

  • At least 8 years of experience in software engineering, preferably with a focus on server side Java development.
  • You are highly proficient in Java, conversant in the standard library of data structures and concurrency constructs.
  • Strong algorithm implementation and optimization skills.
  • Awareness of application security fundamentals, and an ability to think through security risks and trade-offs.
  • Experience designing, leading and owning cross-functional initiatives.

Bonus Points:

  • Experience in any of the following:
    • Authentication protocols such as SAML, OpenID Connect or LDAP
    • TLS and X.509 certificate management
    • Cryptography, including hashing and encryption.
  • You've worked in open source before and are familiar with different styles of source control workflow and continuous integration.
  • You've built things with Elasticsearch before, and understand how distributed systems operate and the limitations and advantages.
Kibana - Lead Architect
Distributed, AMER | Distributed, EMEA

About The Role

Elasticsearch, our core product, is an open source, clustered search engine based highly scalable data store. A wide range of industry and government clients use it to solve real-time data analysis problems of all kinds.

Kibana is the data query and visualization UI that we provide to help users communicate and share their analysis with their colleagues and customers. But it’s more than that; Kibana is the platform on which we’re building integrated user experiences for our solutions, including SIEM, APM (Application Performance Monitoring), Enterprise Search, and more. Eventually, we want Kibana to become an ecosystem where partners and third parties can build their own applications on the Elastic stack.

We are looking for a technology and engineering leader to help us realize our goals. We are looking for someone who has designed and built multi-tier services and UI frameworks and shipped them in world-class products. We need a person who has made critical decisions in the evolution of a big product and platform over multiple years and releases and who can lead a large team with a coherent vision while making appropriate compromises along the way.

Some of the things you'll work on

  • Helping to define the architecture of Kibana and the roadmap to implement it
  • Evaluating new technology, making recommendations on technological solutions, and establishing a technological vision for Kibana
  • Identify current areas of needed improvement in our product architecture and design methodologies.
  • Bring a pragmatic approach to our current technology stack, with an eye toward improving maintainability, testability, re-use, and reliability of our product offerings. Lead the effort to implement changes based on your analysis
  • Leading technical discussions and requirements analysis with internal and external customers to drive alignment with customer needs
  • Reviewing and approving implementation designs for core features of Kibana
  • Communicating and evangelizing the Kibana architecture roadmap internally and with the community
  • Collaborating with other Tech Leads on the Kibana team and across Elastic to align priorities and roadmaps, and make appropriate technology choices and compromises
  • Guiding the Kibana team and other UI teams using the Kibana platform on best practices and help them to stay aligned with the Kibana architecture roadmap
  • Advising Elastic’s leadership team on emerging and important UI/UX technologies and industry trends
  • Representing company at conferences and networking events
  • Writing code to demonstrate a new direction or feature via POC’s and prototypes
  • Reviewing code for critical implementations to ensure they achieve their architecture goals
  • Interviewing engineers for the team
  • Mentoring senior engineers 

What you will bring along:

  • You have been in a technical leadership role for a product with a significant user-interface codebase
  • Previous architect or technical lead experience on large projects with high level of autonomy ie: more than 30 engineers and multiple years and releases
  • Understanding of JavaScript and its ecosystem
  • Experience building UI and backend platforms targeted at developers.
  • Excellent verbal and written communication skills, a great teammate with strong analytical, problem solving, debugging and troubleshooting skills
  • Direct experience with our technology stack Linux, Windows, iOS, multiple browsers, JavaScript, and building full stack applications with REST APIs, databases and distributed services

Nice to have

  • Experience using or managing the Elastic Stack and Kibana
  • Record of inheriting existing medium-to-large scale projects
  • Have worked on software that is distributed as installable artifacts (not a SaaS)
  • Have worked on software with a plugin system
  • In-depth JavaScript knowledge
  • Understanding of many legacy and modern JavaScript frameworks.
  • Direct experience with Node.js, React, Redux

Engineering Philosophy

Engineering a highly complex distributed system that is easy to operate via elegantly designed APIs is a non-trivial effort. It requires solid software development skills, and more importantly, a sharp mind and the ability to think like a user. We also care deeply about giving you full ownership of what you’re working on. Our company fundamentally believes great minds achieve greatness when they are set free and are surrounded and challenged by their peers, which is clearly visible in our organization. At Elastic, we effectively don’t have a hierarchy to speak of. We feel that anyone needs to be in the position to comment on truly anything, regardless of his or her role within the company.

Observability - Senior Golang Enginee...
Distributed, EMEA | Distributed, AMER

The Observability team is in charge of developing solutions that focus on application developers and engineers that run infrastructure and services supporting these applications. Elasticsearch is an efficient datastore for logs, metrics, and application traces, supporting the three pillars of observability. The Observability team builds and maintains solutions that make getting insights from this data turnkey and efficient, such as our APM, Metrics, Logs and Uptime solutions. When developing these solutions, we think about the problem end-to-end: how do we automatically collect data from common data sources, how do we store it efficiently in Elasticsearch, how do we present this information to the user, what actions do we take on the insights from the data? All of these aspects are important in bringing a turnkey solution to the market.

 

What you will be working on:

  • Maintain and evolve the current Beats platform, Elastic Agent and Fleet Server, all written in Go.
  • Develop features for macOS, Windows and Linux.
  • Design and build new features for those components.
  • Work with our support team to help customers.
  • Interact with the community, understand their needs and help them use our products.
  • Collaborate with other development teams, quality engineering team and documentation team to execute on product deliverables.
  • Spread the word, write blog posts about features you worked on, speak at conferences, if you are willing to.

 

The roadmap for the broader team:

  • The main goal for the team is delivering the new ingest experience, consisting of the the new Elastic Agent, the Fleet management platform, and the new package management approach to delivering integrations.
  • As part of this new experience, there are multiple improvements across the whole stack that will require the team’s involvement.
  • The team will also need to focus on how to help users transition from the current to the new experience.

 

What you will bring along:

 

  • Production experience with Go.
  • Experience creating system level software and back-end services on Linux. 
  • System-level knowledge of Linux, Windows or MacOS.
  • Operational experience with monitoring systems.
  • Excellent verbal and written communication skills, a great teammate with strong analytical, problem solving, debugging, and troubleshooting skills.
  • Ability to work in a distributed team throughout the world.

 

Bonus Skills:

 

  • Hands-on experience with Docker and Kubernetes
  • Experience monitoring or operating Kubernetes clusters.
  • Experience developing ETL pipelines.
 
 
Support Engineer
Distributed, AMER

At Elastic, we have a simple goal: to solve the world's data problems with products that delight and inspire. As the company behind the popular open source projects — Elasticsearch, Kibana, Logstash, and Beats — we help people around the world do great things with their data. From stock quotes to real time Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. The Elastic family unites employees across 30+ countries into one coherent team, while the broader community spans across over 100 countries.

Our Philosophy

We’re always on the search for amazing people. People who have a deep passion for technology and are masters at their craft. We build highly sophisticated distributed systems and we don’t take our technology lightly. At Elastic, you’ll have the opportunity to work in a vibrant young company next to some of the smartest and highly skilled technologists the industry has to offer. We’re looking for great team players, but we also promote independence and ownership. We’re hackers… The good kind. The kind that innovates and creates cutting edge products that eventually translates to a lot of happy, smiling faces.

Elastic’s Support team is unlike any other on Earth: while we are spread across 12 time zones and 15 countries, we operate as a unit, as a family. We are massively distributed yet lovingly intimate. The service we deliver is caring, empathetic, and human yet ambitious, direct, and comprehensive. We don’t do it alone, as we are tightly integrated with our core developers in a way that is real, genuine, and never taken for granted. Our team is a dream for someone seeking honest, hard work and rewards. It is a nightmare for anyone afraid to ask questions or be questioned — always forward, never backward. We are not trying to change the world, we already have. We’re just waiting for everyone else to catch up. We have our good days and our not so good days, and we face them together, as a family. We listen. We solve. We guide. We get it. And we’re excited to bring on a new team member to love.

What you will do

  • Ensuring technical customer issues are serviced within our contractual SLA, and managed to resolution.
  • Maintain strong relationships with our customers for the delivery of technical support.
  • Have a mindset of continuous improvement, in terms of efficiency of support processes and customer satisfaction.
  • Work across multi-cultural and geographically distributed teams.

What you will bring

  • Proven experience in Technical Support in a technology businesses.
  • A technical background in fields like; System Administration, Network Engineering, Software Engineering etc.
  • Strong verbal and written communication skills 
  • A customer oriented focus.
  • Experience in a Linux/Unix environment
  • Hands on experience with the Elastic Stack

Furthermore this will help you

  • Knowledge of databases or search software technologies.
  • Experience with SaaS and/or Distributed systems.
  • You are a Team player; ability to work in a fast paced environment with a positive and adaptable approach.
  • Proven strong technical understanding of software products.
  • Collaboration with Developer teams on escalation issues.
  • Curiosity is key! You have already researched our products and possibly downloaded our stack to test it out.

Additional Information - We Take Care of Our People

At Elastic, we strive to have parity of benefits across regions. While regulations differ from place to place, we believe taking care of people is the right thing to do.

  • Healthcare for you and your family in many locations.
  • Flexible location and schedule for many roles.
  • Generous number of vacation days each year.
  • Double your charitable giving — we match up to 1% of your salary.
  • Up to 40 hours each year to use toward volunteer projects you love.
Verified by
Product Lead, Observability
You may also like