Observability with the ELK Stack

1,791
Elastic
Creators of ELK / Elastic Stack (Elasticsearch, Logstash, Kibana, Beats & More)

Written By Tanya Bragin, Product Lead, Elastic


In my role as a Product Lead for Observability at Elastic, I get a few different reactions when I use the term 'observability'. The most common reaction by far today still is: "What is 'observability'?" But I also increasingly hear things like: "We just kicked-off an 'observability initiative', but we're still figuring out exactly how to go about it." And finally, some organizations we have been fortunate to work with already consider 'observability' an integral part of how they design and build products and services.

Given that the term is still gaining traction, I thought it would be useful to demystify how we at Elastic view 'observability', what we learned from our thought-leading customers, and how we think about it from the product perspective as we evolve our stack for operational use cases.

What is 'Observability'?

We certainly did not invent the term 'observability'. We started hearing about it from users, primarily those within the Site Reliability Engineering (SRE) community. Several sources trace back beginnings of this term to SRE organizations from Silicon Valley giants like Twitter. And even though the seminal Google SRE Book does not mention the term, it lays out many of the principles associated with 'observability' today.

'Observability' is not something that a vendor delivers in a box -- it is an attribute of a system you build, much like usability, high availability, and stability. The goal of designing and building an 'observable' system is to make sure that when it is run in production, operators responsible for it can detect undesirable behaviors (e.g. service downtime, errors, slow responses) and have actionable information to pin down root cause in an effective manner (e.g. detailed event logs, granular resource usage information, and application traces). Common challenges preventing organizations from achieving this seemingly obvious goals include not collecting enough information, collecting too much information, but not making it actionable, and fragmenting access to this information.

The first aspect — detection of undesirable behaviors — usually starts with setting of Service Level Indicators (SLIs) and Objectives (SLOs). These are internal measures of success by which production systems are judged in observability-minded organizations. If there is a contractual obligation to fulfill these objectives, an SLI/SLO may also translate to a Service Level Agreements (SLAs). The most common example of an SLI is system uptime, for which you may set an SLO of 99.9999%. System uptime is also the most common SLA exposed to external customers. However, your SLI/SLOs internally may be a lot more granular, and monitoring and alerting on these most important factors of production system behavior is the basis of any observability initiative. This aspect of observability is also known by the term "monitoring".

The second aspect — providing operators with granular information to debug production issues quickly and efficiently — is an area where we see a lot of movement and innovation. There is quite a bit of talk about the "three pillars of observability" — metrics, logs, and application traces. There is also recognition that simply collecting all this granular data using a patchwork of tools is not necessarily actionable and often not cost effective.

'Pillars' of Observability

Let's examine these data collection aspects in more detail. The status quo we typically encounter today is to collect metrics into one system (usually a time series database or a SaaS service for resource monitoring), collect logs into a second system (unsurprisingly, often the ELK stack in our conversations), and to use yet a third tool to instrument applications to provide request level tracing. When an alert fires, indicating a breach in a service level, operators madly dart over to their systems and perform the best "swivel chair integration" they can -- looking at metrics in one browser window, manually correlating it to logs in another window, and pulling up traces (if relevant) in yet a third window.

This approach has several drawbacks. First, manual correlation of different data sources all telling the same story wastes valuable time during service degradation or outage. Second, operational costs of maintaining three different operational data stores are onerous — licensing costs, separate headcount for administrators of disparate operational tools, inconsistent machine learning capabilities in each datastore, "headspace" for thinking through different semantics for alerting — every organization I speak with struggles with all of these challenges.

There is an increasing recognition of how important it is to have all this information in a single operational store with the ability to automatically correlate this data in an intuitive user interface. Nirvana for the users we talk to is to expose their operators to every piece of data relevant to the service they are supporting in a unified way, whether it be a log line emitted by the application, trace data resulting from instrumentation, or resource utilization represented by metrics in a time series. Requirements we hear about stress uniform, ad-hoc access to this data regardless of the source, from search and filtering, to aggregations, to visualizations. Starting with metrics and drilling into logs and traces in a few clicks without switching context accelerates investigations. Similarly, extracting numerical values from structured logs looks surprisingly like metrics and visualizing both side-by-side has tremendous value from an operational perspective.

As mentioned before, simply collecting the data may result in too much information on disk and not enough actionable intelligence when an incident occurs. Increasingly, there is an expectation that the system collecting operational data provides automatic detection of "interesting" events, traces, and anomalies in the patterns of time series. This helps operators investigating a problem zero in on the root cause faster. These anomaly detection capabilities are sometimes referred to as the "fourth pillar of observability". Detecting anomalies across uptime data, resource utilization, anomalies in logging patterns, and most relevant traces is an emerging requirement observability teams put forth.

Observability... and the ELK Stack?

So what does observability have to do with the Elastic Stack (or ELK Stack, as it's lovingly referred to in operational circles)?

ELK Stack is widely known as the de facto way to centralize logs from operational systems. The assumption is that Elasticsearch (a "search engine") is a good place to put text-based logs for the purposes of free-text search. And indeed, simply searching text-based logs for the word "error" or filtering logs based on a set of a well-known tags is extremely powerful, and is often where most users start.

However, as most ELK Stack users know, Elasticsearch as a datastore offers a lot more than an inverted index for efficient full-text search and simple filtering abilities. It also contains a columnar store optimized for storing and operating on dense numerical time series. This columnar store is used to store structure data extracted from parsed logs, both string and numerical. In fact, the use case of converting logs to metrics is what initially drove us to optimize Elasticsearch for efficient storage and retrieval of numbers.

Over time, users started putting numerical time series directly into Elasticsearch, replacing legacy time series databases. Driven by this need, Elastic recently introduced Metricbeat for automated collection of metrics, the concept of automatic rollups, and other metrics-specific functionality both in the datastore and the UI. As a result, increasingly more users that have adopted the ELK Stack for logs, have also started putting metric data, such as resource utilization, into the Elastic Stack. In addition to operational savings already mentioned above, one attractive reason for this was lack of restrictions Elasticsearch places on cardinality of fields eligible for numerical aggregations (a common gripe brought up when discussing many existing time series databases).

Similar to metrics, uptime data has been a highly valued type of data alongside logs, representing an important source of SLO/SLI alerts from an active monitor. Uptime data can provide information about degradation of services, APIs, and websites, oftentimes before the users feel the impact. The bonus is that uptime data is tiny in terms of storage requirements, so a lot of value for very little additional cost.

Within the past year Elastic has also introduced Elastic APM, adding application tracing and distributed tracing capabilities to the stack. This was a natural evolution for us, as several open-source projects and prominent APM vendors were already using Elasticsearch to store and search trace data. Status quo in traditional APM tools is to keep APM trace data separate from logs and metrics, perpetuating operational data silos. Elastic APM offers a set of agents for collecting trace data from supported languages and frameworks as well as supporting OpenTracing, and this trace data is automatically correlated with the metrics and logs.

A common thread across all these data inputs is that each of them is just another index in Elasticsearch. There are no restrictions on aggregations you run on all this data data, how you visualize it in Kibana, and how alerting and machine learning applies to each data source. To see this in action, check out this video.

Observable Kubernetes and the Elastic Stack

One community where the concept of observability is a very active topic of conversation is the set of users adopting Kubernetes for container orchestration. These "cloud native" users, a term popularized by the Cloud Native Computing Foundation (or CNCF), face unique challenges. They face a massive centralization of applications and services built on or migrated to a Kubernetes-powered container orchestration platform, coupled with the trend to split up monolithic apps into "microservices". Tools and methods that worked before to provide necessary visibility into applications running on top of this infrastructure no longer work.

Kubernetes observability deserves a separate post all on its own, so for now I will refer you to the Observable Kubernetes webinar and the Distributed Tracing with Elastic APM blog post for more information.

What's next?

In a post like this, it seems appropriate to leave the reader with a few resources to explore.

To learn more about observability best practices, I recommend starting with the above-mentioned Google SRE Book. Blog posts from companies whose livelihood depends on flawless operation of their critical apps in production are also typically very thought-provoking. For example, I find this recent post by Salesforce engineering to be a pragmatic and practical guide to iteratively improving the state of observability.

To try out Elastic Stack capabilities for your observability initiatives, spin up the latest version of our stack on the Elasticsearch Service on Elastic Cloud (great sandbox even if ultimately you deploy self-managed), or download and install Elastic Stack components locally. Make sure to check out the new Logs, Infrastructuremonitoring, APM, and Uptime (coming soon in 6.7) UIs in Kibana, purpose-built for common observability workflows. And feel free to ping us with questions on Discuss forums — we're there to help!

Elastic
Creators of ELK / Elastic Stack (Elasticsearch, Logstash, Kibana, Beats & More)
Tools mentioned in article
Open jobs at Elastic
Cloud Engineering Director
Distributed

At Elastic, we have a simple goal: to solve the world's data problems with products that delight and inspire. As the company behind the popular open source projects — Elasticsearch, Kibana, Logstash, and Beats — we help people around the world do great things with their data. From stock quotes to real time Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. The Elastic family unites employees across 30+ countries into one coherent team, while the broader community spans across over 100 countries.

The Elastic Cloud team is growing rapidly and we are now looking to add an Engineering Director to help make this a positive experience for the team and help us continue to deliver quality work. Our team is a dynamic and rapidly growing group of engineers based all around the world, covering a multitude of countries and timezones. We're solving some very hard problems with innovative tech and have fun along the way. This is a great opportunity to help lead our Cloud Engineering efforts and make an immediate impact to our strategy and implementation.

What You Will Be Doing:
  • Work with the rest of Cloud and Product Management to define projects and plans
  • Manage a hiring roadmap to expand our team
  • Help the team manage projects and roadmaps against release targets
  • Works with other Leads to make changes impacting multiple projects.
  • Work with the teams to ensure good engineering practices are being followed.
  • Lead multiple software projects simultaneously
  • Remove barriers blocking the team
  • Own the performance management process for your teams
  • Help individuals grow and the team be successful
  • Be a respectful, communicative teammate
  • Your heart at the right place, good sense of humor and ability to form connections with the people you work with
What You Will Bring Along:
  • BS Degree in Computer Science or equivalent experience
  • You’ve built software for a SaaS service
  • Solid experience with 3rd party Cloud hosting services (AWS, GCP, Azure, etc.)
  • Experience building a multi region cloud infrastructure (must have)
  • Experience with cloud orchestration
  • A love for automated testing
  • 5+ years experience leading teams of software engineers
  • Experience running a distributed team
  • Kubernetes 
  • 5+ years as a hands-on software engineer (C++, Java, Scala, C#, or Python or similar programming languages) so you understand the core principles of the engineering work that is going on in your team
  • Your heart at the right place, good sense of humour and ability to form connections with the people you work with
Bonus Points:
  • Distributed systems design and development
  • Experience supporting customer problems and communication
  • Speaking/presenting at tech conferences
  • Managing on-call schedules (PagerDuty, VictorOps, etc)
  • Open source projects and companies
  • Secure coding practices (static code analysis, security code reviews, etc.)

#LI-DA1

Additional Information:
  • Competitive pay based on the work you do here and not your previous salary
  • Equity
  • Global minimum of 16 weeks of parental leave (moms & dads)
  • Generous vacation time and one week of volunteer time off 
  • Your age is only a number. It doesn't matter if you're just out of college or your children are; we need you for what you can do.

Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.

Cloud - Principal Engineer - Python
Distributed

At Elastic, we have a simple goal: to solve the world's data problems with products that delight and inspire. As the company behind the popular open source projects — Elasticsearch, Kibana, Logstash, and Beats — we help people around the world do great things with their data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. Diversity drives our vibe. We unite Elasticians across 30+ countries into one coherent team, while the broader community spans across over 100 countries.

About The Role

You will be responsible for technical design and work in the Billing area of Elastic’s Cloud product. This area is responsible for services that power Elastic’s SaaS subscription billing platform. This role is very impactful to Elastic -- you will collaborate with partners from Finance, Sales Operations, and IT teams to provide end-to-end billing and invoicing capabilities for our users. You will lead by example and should participate in coding, debugging complex failure scenarios, and triaging bugs. You’ll analyze the current system, participate in roadmap and project planning efforts, and will have ownership for delivering it. You’ll be participating in project management efforts as the teams execute on plans, and you’ll have a role in communicating progress and status to partners.

Engineering Philosophy

Engineering a highly complex distributed system that is easy to operate via elegantly designed components and APIs is a non-trivial effort. It requires solid software development skills, and more importantly, a sharp mind and the ability to think like a user. We also care deeply about giving you full ownership of what you’re working on. Our company fundamentally believes great minds achieve greatness when they are set free and are surrounded and challenged by their peers, which is very transparent in our organization. We feel that anyone needs to be in a position to comment on truly anything, regardless of his or her role within the company.

Some of the things you'll work on

  • Provide technical leadership for the Billing area of Cloud. This includes working on the parts of codebase that brings in monthly subscription revenue from thousands of Elasticsearch Service users.
  • Lead the team to help build scalable solutions to deliver the best payments experience for the Elasticsearch Service.
  • Own, curate, and execute your area's product roadmap, partnering with product managers, engineering managers, and other peers across Cloud and Elastic teams.
  • Understand our company strategy and help to translate it into technical deliverables and guide Cloud’s product direction to realize it.
  • Create technical designs and build POCs for new efforts, validating a wild idea works before committing to it.
  • Be a contact point in Cloud for other teams within Elastic. Examples include helping Support with difficult cases or consulting with the Sales Operations and IT teams to improve our end to end billing and reporting flow.
  • Be hands-on with the codebase. Review work done by the team, and provide constructive feedback.
  • Help the team define coding practices and standards.

What you will bring along 

  • Proven experience as a software engineer, with a track record of delivering high-quality code.
  • Experience with Python as a programming language.
  • Comfortable working and communicating with teams from several functions – finance, data analytics, IT, etc.
  • Previous experience working with a SaaS platform or product.
  • Previous experience providing technical leadership for a team of software engineers.
  • Previous experience in an ownership role for roadmap curation and execution.
  • Experience gathering or analyzing usage data from hosting providers (AWS, Azure, GCP, etc.) is a huge plus.
  • Experience working on a SaaS subscription billing platform such as Stripe, Zoura, Recurly is a huge plus.

 

#LI-JA1

Additional Information

  • Deeply competitive pay and benefits
  • Equity compensation
  • Catered lunches, snacks, and beverages in most offices
  • An environment in which you can balance great work with a great life
  • Passionate people building phenomenal products

 

Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.



Consulting Engineer - Germany
Distributed, EMEA

At Elastic, we have a simple goal: to solve the world's data problems with products that delight and inspire. As the company behind the popular open source projects — Elasticsearch, Kibana, Beats, and Logstash — we help people around the world do great things with their data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. We unite Elasticians across 30+ countries (and counting!), 18 timezones, and 30 different languages into one coherent team, while the broader community spans across over 100 countries.

You will have the opportunity to work with a tremendous services, engineering and sales team and wear many hats.  This is a critical role, as Consultants have an amazing chance to make an immediate impact on the success of Elastic and our customers.

What You Will Be Doing:
  • Deliver Elastic solutions to drive customer business value from our products
  • Solution design, development, and integration of Elastic products and APIs, platform architecture, and capacity planning in mission-critical environments
  • Strong customer advocacy, relationship building, and communications skills
  • Comfortable working remotely in a highly distributed team
  • Development of demos and proof-of-concepts that highlight the value of the Elastic Stack
  • Data modeling, query development and optimization, cluster tuning and scaling with a focus on fast search and analytics at scale
  • Solving our customers’ most challenging data problems
  • Working closely with the Elastic engineering, product management, and support teams to identify feature enhancements, extensions, and product defects
  • Engaging with the Elastic Sales team to scope opportunities while assessing technical risks, questions, or concerns
What You Bring Along:
  • Hands-on experience and an understanding of Elasticsearch and/or Lucene
  • Minimum of 2 years’ experience as a Software Engineer, System Administrator, or DevOps Engineer
  • Minimum of 5 years' experience working as a Consultant, working to deliver and execute on professional services engagements
  • Experience as a technical instructor or public speaker to large audiences on enterprise infrastructure software technology to engineers, developers, and other technical positions
  • Excel at working directly with customers to gather, prioritize, plan and execute solutions to customer business requirements as it relates to our technologies
  • Understanding and passion for open-source technology and knowledge and proficient in at least one programming language
  • Hands-on experience with large distributed systems from an architecture and development perspective
  • Knowledge of information retrieval and/or analytics domain
  • Ability to travel up to 65% of the time
  • Understanding of Linux, Java and databases
  • Fluent English and German
Bonus Points:
  • Deep understanding of Elasticsearch and Lucene, including Elastic Certified Engineer certification
  • BS, MS or PhD in Computer Science or related engineering discipline
  • Strong knowledge of Java and Linux/Unix environment, software development, and/or experience with distributed systems
  • Experience and interest in delivering and/or developing product training
  • Experience contributing to an open-source project or documentation
Additional Information:

We're looking to hire team members invested in realizing the goal of making real-time data exploration easy and available to anyone. As a distributed company, we believe that diversity drives our vibe! Whether you're looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life.

  • Competitive pay based on the work you do here and not your previous salary
  • Equity
  • Global minimum of 16 weeks of parental leave (moms & dads)
  • Generous vacation time and one week of volunteer time off
  • Your age is only a number. It doesn't matter if you're just out of college or your children are; we need you for what you can do.

#LI-BH1

 Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.

 
Director, Global Community and Develo...
Distributed, AMER/EMEA

For all of us at Elastic, community is core to our company identity. Our users and contributors have helped to ensure that Elasticsearch, Kibana, Logstash, and Beats are more than just code — they are open source projects that people love to use, and love to talk about!

We are searching for an experienced leader to provide management, mentorship and vision as our Head of Community, being responsible for the overall success of our community efforts globally. Community is part of the Marketing team and is tightly engaged with Elastic's global and distributed Engineering team. This includes attending Engineering All-Hands meetings and offsites, and working closely with Engineering Team Leads and Product Managers, as well as Product Marketing and our Solution Architects. This ensures alignment to a consistent global strategy and program execution with the freedom to experiment with new ideas.

Our company fundamentally believes great minds achieve greatness when they are set free and are surrounded and challenged by their peers, which is clearly visible in our organization. Our team is a dynamic and rapidly growing group of advocates and program managers based all around the world, covering a multitude of countries and timezones.

What you will be doing:

  • Management & Mentorship. Mentor and lead a distributed team of Team Leads, Community Advocates, and Program Managers globally. Recognize and develop individuals’ talents and skills, and ensure there are sustainable structures in place for growth.
  • Strategic Leadership. Build and implement a Community Strategy alongside partners in Engineering, Sales and Marketing to forge the future of Elastic.
  • Partnerships. Both internally with other leaders and team, and externally with like-minded technologies and groups, to advocate for the community and to highlight, and promote, Elastic’s community approach.
  • Team-environment. This role will actively foster a culture of mutual respect, collaboration, and consensus-based decision-making. We will support you in that endeavor.
  • Fast-paced, sustainable growth. Plan team capacity, help drive recruitment of high quality people during an era of rapid growth.

What you will bring along:

  • Community building is fundamental to your identity.
  • You are a strategic thinker and lead with agility and humility. A respectful, communicative leader who has a passion for developing others in the team.
  • Experience with open source software and/or commercial open source companies.
  • Comfort making data-enhanced decisions, including financial management and budgeting.
  • Proven ability to craft compelling content about technology or demonstrated ability to guide those doing so.

Bonus Points:

  • Familiarity with, and a real passion for, the Elastic Stack.
  • Experience working for a startup or an early stage company.
  • Comfort working with, and managing, a highly distributed team.
  • Experience with application search, log analytics, or security analytics is deeply valuable.

Additional Information - We Take Care of Our People

At Elastic, we strive to have parity of benefits across regions. While regulations differ from place to place, we believe taking care of people is the right thing to do.

  • Health coverage for you and your family.
  • Flexible location and schedule for many roles.
  • Generous number of vacation days each year.
  • Double your charitable giving — we match up to 1% of your salary.
  • Up to 40 hours each year to use toward volunteer projects you love.

Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.

Verified by
Product Lead, Observability
You may also like
Transforming the Management of Application Configurations & Secrets at 24 Hour Fitness
Building a Kubernetes Platform at Pinterest
Rust at OneSignal
How to Practically Use Performance API to Measure Performance