Observability with the ELK Stack

2,089
Elastic
Creators of ELK / Elastic Stack (Elasticsearch, Logstash, Kibana, Beats & More)

Written By Tanya Bragin, Product Lead, Elastic


In my role as a Product Lead for Observability at Elastic, I get a few different reactions when I use the term 'observability'. The most common reaction by far today still is: "What is 'observability'?" But I also increasingly hear things like: "We just kicked-off an 'observability initiative', but we're still figuring out exactly how to go about it." And finally, some organizations we have been fortunate to work with already consider 'observability' an integral part of how they design and build products and services.

Given that the term is still gaining traction, I thought it would be useful to demystify how we at Elastic view 'observability', what we learned from our thought-leading customers, and how we think about it from the product perspective as we evolve our stack for operational use cases.

What is 'Observability'?

We certainly did not invent the term 'observability'. We started hearing about it from users, primarily those within the Site Reliability Engineering (SRE) community. Several sources trace back beginnings of this term to SRE organizations from Silicon Valley giants like Twitter. And even though the seminal Google SRE Book does not mention the term, it lays out many of the principles associated with 'observability' today.

'Observability' is not something that a vendor delivers in a box -- it is an attribute of a system you build, much like usability, high availability, and stability. The goal of designing and building an 'observable' system is to make sure that when it is run in production, operators responsible for it can detect undesirable behaviors (e.g. service downtime, errors, slow responses) and have actionable information to pin down root cause in an effective manner (e.g. detailed event logs, granular resource usage information, and application traces). Common challenges preventing organizations from achieving this seemingly obvious goals include not collecting enough information, collecting too much information, but not making it actionable, and fragmenting access to this information.

The first aspect — detection of undesirable behaviors — usually starts with setting of Service Level Indicators (SLIs) and Objectives (SLOs). These are internal measures of success by which production systems are judged in observability-minded organizations. If there is a contractual obligation to fulfill these objectives, an SLI/SLO may also translate to a Service Level Agreements (SLAs). The most common example of an SLI is system uptime, for which you may set an SLO of 99.9999%. System uptime is also the most common SLA exposed to external customers. However, your SLI/SLOs internally may be a lot more granular, and monitoring and alerting on these most important factors of production system behavior is the basis of any observability initiative. This aspect of observability is also known by the term "monitoring".

The second aspect — providing operators with granular information to debug production issues quickly and efficiently — is an area where we see a lot of movement and innovation. There is quite a bit of talk about the "three pillars of observability" — metrics, logs, and application traces. There is also recognition that simply collecting all this granular data using a patchwork of tools is not necessarily actionable and often not cost effective.

'Pillars' of Observability

Let's examine these data collection aspects in more detail. The status quo we typically encounter today is to collect metrics into one system (usually a time series database or a SaaS service for resource monitoring), collect logs into a second system (unsurprisingly, often the ELK stack in our conversations), and to use yet a third tool to instrument applications to provide request level tracing. When an alert fires, indicating a breach in a service level, operators madly dart over to their systems and perform the best "swivel chair integration" they can -- looking at metrics in one browser window, manually correlating it to logs in another window, and pulling up traces (if relevant) in yet a third window.

This approach has several drawbacks. First, manual correlation of different data sources all telling the same story wastes valuable time during service degradation or outage. Second, operational costs of maintaining three different operational data stores are onerous — licensing costs, separate headcount for administrators of disparate operational tools, inconsistent machine learning capabilities in each datastore, "headspace" for thinking through different semantics for alerting — every organization I speak with struggles with all of these challenges.

There is an increasing recognition of how important it is to have all this information in a single operational store with the ability to automatically correlate this data in an intuitive user interface. Nirvana for the users we talk to is to expose their operators to every piece of data relevant to the service they are supporting in a unified way, whether it be a log line emitted by the application, trace data resulting from instrumentation, or resource utilization represented by metrics in a time series. Requirements we hear about stress uniform, ad-hoc access to this data regardless of the source, from search and filtering, to aggregations, to visualizations. Starting with metrics and drilling into logs and traces in a few clicks without switching context accelerates investigations. Similarly, extracting numerical values from structured logs looks surprisingly like metrics and visualizing both side-by-side has tremendous value from an operational perspective.

As mentioned before, simply collecting the data may result in too much information on disk and not enough actionable intelligence when an incident occurs. Increasingly, there is an expectation that the system collecting operational data provides automatic detection of "interesting" events, traces, and anomalies in the patterns of time series. This helps operators investigating a problem zero in on the root cause faster. These anomaly detection capabilities are sometimes referred to as the "fourth pillar of observability". Detecting anomalies across uptime data, resource utilization, anomalies in logging patterns, and most relevant traces is an emerging requirement observability teams put forth.

Observability... and the ELK Stack?

So what does observability have to do with the Elastic Stack (or ELK Stack, as it's lovingly referred to in operational circles)?

ELK Stack is widely known as the de facto way to centralize logs from operational systems. The assumption is that Elasticsearch (a "search engine") is a good place to put text-based logs for the purposes of free-text search. And indeed, simply searching text-based logs for the word "error" or filtering logs based on a set of a well-known tags is extremely powerful, and is often where most users start.

However, as most ELK Stack users know, Elasticsearch as a datastore offers a lot more than an inverted index for efficient full-text search and simple filtering abilities. It also contains a columnar store optimized for storing and operating on dense numerical time series. This columnar store is used to store structure data extracted from parsed logs, both string and numerical. In fact, the use case of converting logs to metrics is what initially drove us to optimize Elasticsearch for efficient storage and retrieval of numbers.

Over time, users started putting numerical time series directly into Elasticsearch, replacing legacy time series databases. Driven by this need, Elastic recently introduced Metricbeat for automated collection of metrics, the concept of automatic rollups, and other metrics-specific functionality both in the datastore and the UI. As a result, increasingly more users that have adopted the ELK Stack for logs, have also started putting metric data, such as resource utilization, into the Elastic Stack. In addition to operational savings already mentioned above, one attractive reason for this was lack of restrictions Elasticsearch places on cardinality of fields eligible for numerical aggregations (a common gripe brought up when discussing many existing time series databases).

Similar to metrics, uptime data has been a highly valued type of data alongside logs, representing an important source of SLO/SLI alerts from an active monitor. Uptime data can provide information about degradation of services, APIs, and websites, oftentimes before the users feel the impact. The bonus is that uptime data is tiny in terms of storage requirements, so a lot of value for very little additional cost.

Within the past year Elastic has also introduced Elastic APM, adding application tracing and distributed tracing capabilities to the stack. This was a natural evolution for us, as several open-source projects and prominent APM vendors were already using Elasticsearch to store and search trace data. Status quo in traditional APM tools is to keep APM trace data separate from logs and metrics, perpetuating operational data silos. Elastic APM offers a set of agents for collecting trace data from supported languages and frameworks as well as supporting OpenTracing, and this trace data is automatically correlated with the metrics and logs.

A common thread across all these data inputs is that each of them is just another index in Elasticsearch. There are no restrictions on aggregations you run on all this data data, how you visualize it in Kibana, and how alerting and machine learning applies to each data source. To see this in action, check out this video.

Observable Kubernetes and the Elastic Stack

One community where the concept of observability is a very active topic of conversation is the set of users adopting Kubernetes for container orchestration. These "cloud native" users, a term popularized by the Cloud Native Computing Foundation (or CNCF), face unique challenges. They face a massive centralization of applications and services built on or migrated to a Kubernetes-powered container orchestration platform, coupled with the trend to split up monolithic apps into "microservices". Tools and methods that worked before to provide necessary visibility into applications running on top of this infrastructure no longer work.

Kubernetes observability deserves a separate post all on its own, so for now I will refer you to the Observable Kubernetes webinar and the Distributed Tracing with Elastic APM blog post for more information.

What's next?

In a post like this, it seems appropriate to leave the reader with a few resources to explore.

To learn more about observability best practices, I recommend starting with the above-mentioned Google SRE Book. Blog posts from companies whose livelihood depends on flawless operation of their critical apps in production are also typically very thought-provoking. For example, I find this recent post by Salesforce engineering to be a pragmatic and practical guide to iteratively improving the state of observability.

To try out Elastic Stack capabilities for your observability initiatives, spin up the latest version of our stack on the Elasticsearch Service on Elastic Cloud (great sandbox even if ultimately you deploy self-managed), or download and install Elastic Stack components locally. Make sure to check out the new Logs, Infrastructuremonitoring, APM, and Uptime (coming soon in 6.7) UIs in Kibana, purpose-built for common observability workflows. And feel free to ping us with questions on Discuss forums — we're there to help!

Elastic
Creators of ELK / Elastic Stack (Elasticsearch, Logstash, Kibana, Beats & More)
Tools mentioned in article
Open jobs at Elastic
APM - Node.js Engineer
Distributed, Global

At Elastic, we have a simple goal: to solve the world's data problems with products that delight and inspire. As the company behind the popular open source projects — Elasticsearch, Kibana, Logstash, and Beats — we help people around the world do great things with their data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. The Elastic family unites employees across 30+ countries into one coherent team, while the broader community spans across over 100 countries.

The Observability team is in charge of developing solutions that focus on application developers and engineers that run infrastructure and services supporting these applications. Elasticsearch is an efficient datastore for logs, metrics, and application traces, supporting the three pillars of observability. The Observability team builds and maintains solutions that make getting insights from this data turnkey and efficient, such as our APM, Infrastructure Monitoring, and Logs solutions. When developing these solutions, we think about the problem end-to-end: how do we automatically collect data from common data sources, how do we store it efficiently in Elasticsearch, how do we present this information to the user, what actions do we take on the insights from the data? All of these aspects are important in bringing a turnkey solution to the market.

As a Node.js Agent Engineer on the APM team, you will be part of a team developing a high quality, open source APM product aimed to help fellow Node.js developers instrument, debug, and monitor Node.js applications. As part of the agent team, you’ll be deeply involved with the entire codebase and take on responsibilities for new features, improving the resource footprint and roadmap planning. You will also be engaging with the open source community.

You will also collaborate closely with the APM Server team when adding new features to the server API and with the UI team to ensure that we deliver the best possible experience for Node.js developers.

The team is diverse and distributed across the world, and collaborates on a daily basis over GitHub, Zoom, and Slack.

What you will be doing

  • Improve the Node.js agent for Elastic APM
    • Add new features
    • Build new integrations with popular Node.js modules
    • Improve the current code base
  • The agent is open-source, so the job includes handling community pull requests, issues, etc
  • Collaborate with APM Server and UI teams to ensure the best experience possible for Node.js developers

 

What you will bring along

  • In-depth experience with Node.js, possessing broad knowledge of JavaScript in general and Node.js internals.
  • Previous experience developing APM products or optimization related code. ie.
    • You have worked developing an APM product
    • You have developed or contributed to a performance improvement oriented Node.js library or tooling
  • You know and care about writing performant Node.js code and have traced performance issues yourself.
  • Experience and interest in going deep on advanced topics such as async context propagation, libuv, V8 (garbage collection, memory structure, microtask queue etc), CPU profiling and the like.
  • Experience with Node.js apis such as async_hooks, stream, http2, and worker_threads, as well as popular job queuing libraries etc. is a big plus.
  • Ability to work independently in a globally distributed team.

 

Additional Information - We Take Care of Our People

As a distributed company, diversity drives our identity. Whether you’re looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life. Your age is only a number. It doesn’t matter if you’re just out of college or your children are; we need you for what you can do.

We strive to have parity of benefits across regions and while regulations differ from place to place, we believe taking care of our people is the right thing to do.

  • Competitive pay based on the work you do here and not your previous salary
  • Health coverage for you and your family in many locations
  • Ability to craft your calendar with flexible locations and schedules for many roles
  • Generous number of vacation days each year
  • Double your charitable giving — we match up to 1% of your salary
  • Up to 40 hours each year to use toward volunteer projects you love
  • Embracing parenthood with minimum of 16 weeks of parental leave 

 

Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.

#LI-WN1

Solution Architect - Japan
Tokyo

At Elastic, we have a simple goal: to solve the world's data problems with products that delight and inspire. As the company behind the popular open source projects — Elasticsearch, Kibana, Logstash, and Beats — we help people around the world do great things with their data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. We unite Elasticians across 30+ countries (and counting!), 18 timezones and 30 different languages into one coherent team, while the broader community spans across over 100 countries.

Are you looking to make a real impact and play a meaningful role in the growth of our company?

As a Solutions Architect at Elastic you will serve as a technical authority and trusted advisor to our sales team, customers, partners and community. You will understand and solve our customer’s business issues with the Elastic Stack, engage the regional Elastic community through events and programs, and enable sales through our Partners. A successful SA at Elastic will be focused on excellence; taking the initiative to improve both themselves and the team through continuous learning and questioning the status quo.

What You Will Be Doing:

  • Serving as the technical point of contact for your accounts and account managers in your assigned territory.
  • Developing a deep understanding of customers’ goals and objectives, and articulating how our offerings address their needs.
  • Creating and owning value based relationships at all levels in customer organizations.
  • Actively participating in all phases of planning and execution for your territory, from initial discovery to the technical win.
  • Developing and maintaining a deep understanding of the Elastic products and solutions to demonstrate the value of our offerings in sales meetings, and at events such as meetups and conferences.
  • Advising the sales team on effective ways of positioning Elastic products, solutions and services.
  • Onboarding, educating and enabling our partners, and supporting them in sales cycles.
  • Creating collateral, contributing to programs and collaborating with other Elasticians to meet individual client needs.
  • Being the voice of the customer and community to communicate needs, gaps, and enhancements to our engineering and leadership teams.
  • Deepening both your sales and technical skills through self driven education while taking advantage of all the professional development opportunities provided by Elastic.

What You Will Bring Along:

  • A track record of success in a technical presales role-- enough experience selling and implementing technology to earn your customer’s trust.
  • A demonstrable ability to articulate and sell the benefits of modern platforms, software and technologies.
  • A real passion for being curious and a continuous learner. You are someone that invests in yourself as much as you invest in your professional relationships.
  • A history of successful customer relations where you developed an understanding of what made a difference, and devised architectures that helped meet a goal, tackle a problem, or outpace competitors.
  • An ability to influence. Have you more than once convinced a team you worked for, of an idea, technology, or architectural pattern?
  • The ability to inspire groups, both large and small.
  • A willingness to travel 30% within region, as well as occasionally internationally.
  • Specialised training in Information Security and Cybersecurity
  • Fluency in English (business) and Japanese (native)

Why Elastic?

“Once I started, I was blown away to realize, what was always so obvious, everything is search. Anything you do with data involves search, every app needs search, even analytics are about search. Daily I am blown away with the types of things our users and customers are doing with search.” Director of Product, Elastic.

Why this team?

“We question the status quo, debate the best ways to accomplish our goals and hold ourselves and our teams to the highest standards of performance. We take pride in our ability to understand our customers needs and make recommendations that are based on extensive global experience across multiple use cases. We learn from our mistakes and losses and celebrate our successes with an eye towards constant improvement. Join this team if you are up for a compassionate, understanding and fair environment where you can really grow into the next phase of your career.” VP WW Solution Architecture, Elastic

Additional Information:

We're looking to hire team members invested in realizing the goal of making real-time data exploration easy and available to anyone. As a distributed company, we believe that diversity drives our vibe! Whether you're looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life.

  • Competitive pay based on the work you do here and not your previous salary
  • Equity
  • Global minimum of 16 weeks of paid in full parental leave (moms & dads)
  • Generous vacation time and one week of volunteer time off
  • Your age is only a number. It doesn't matter if you're just out of college or your children are; we need you for what you can do.

#LI-DL1

Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.

Kibana - Sr. Director Software Engine...
AMER, Distributed

Kibana's role has steadily grown from a query and visualization UI for Elasticsearch, into a platform on which Elastic builds integrated solutions for Security, Observability, and variety of use cases from Machine Learning to Geospatial. Eventually, we want Kibana to become an ecosystem where partners and third parties can build their own applications on the Elastic stack.

We’re looking for a technology and engineering leader ( Sr. Director or VP level ) to continue the trajectory. Today Kibana is a distributed group of over 80 Elasticians covering a range of software development and related specialties including data visualization, security, design, architecture, automation, documentation and more.  The leader of the Kibana team will make decisions critical to the evolution of a large, complex product and platform; manage time horizons of not just the next week or quarter, but multiple years and major releases; they will lead a large and growing team by providing coherent vision and making appropriate compromises along the way.

Kibana’s team leader will resonate with our value system and will bring a collaborative and humble attitude with a strong passion for excellence in engineering. They will make a large scale impact to the team and company.

What You Will Be Doing: 
  • Recruit, retain, develop, and mentor high performing leaders and engineers within the Kibana team.
  • Work with Kibana’s leadership to organize effectively, scale and coordinate delivery. 
  • As a team and company we value excellence, speed, collaboration, respect, mentorship, and open feedback. You’ll encourage and promote these values.   
  • Collaborate closely with Kibana’s product and technical leadership, developing, prioritizing, communicating and delivering Kibana’s vision and roadmap.
  • Work across teams at Elastic to ensure Kibana is meeting the needs of all of the teams that depend on it.
  • Be the face of Kibana within Elastic. Advocate for the team, encourage processes and practices that help keep the entire team happy, healthy, and productive.
  • Encourage strong engineering practices, maintainable software, and operations-friendly behavior of Kibana 
  • Be actively engaged in / understand what teams are working on in Kibana and across Elastic. Use your understanding to guide decisions and guide people in their execution of the roadmap.
What You Bring Along: 
  • Must have led and scaled a large team of engineers and managers 
  • Experience working on complex projects, leading highly complex, technical engineering teams.
  • Experience managing and allocating budget and resources.
  • Must have run and delivered enterprise software solutions at scale
  • Must have led On-Prem software delivery with a maintenance cycle
  • Strong performance management
  • Strong written communications skills
  • Ability to be adaptable and flexible 
Bonus Points:
  • Technical leadership role with a Data Visualization, Business Intelligence, Systems Monitoring, Log Analysis, Search Engine, SIEM or APM product
  • Experience managing distributed & multinational teams 
  • Experience building open source products
  • Experience with at least one of these cloud providers (GCP, AWS, Azure)
Partner Solutions Architect - APJ
Singapore

At Elastic, we have a simple goal: to solve the world's data problems with products that delight and inspire. As the company behind the popular open source projects — Elasticsearch, Kibana, Logstash, and Beats — we help people around the world do great things with their data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. We unite Elasticians across 30+ countries (and counting!), 18 timezones and 30 different languages into one coherent team, while the broader community spans across over 100 countries.

Are you looking to make a real impact and play a meaningful role in the growth of our company?

As a Technology Partner Solutions Architect at Elastic you will lead the prioritization, development, validation, and evangelization of solutions with Elastic's Technology Partners. You will help to refine and message the win-win-win for Elastic, the partner, and the mutual end user when using the combined solution or integration. You'll spend time with our product teams, the partner, and our partner alliances team. An example solution might involve Elastic's alerting triggering events in a security SOAR or IT Case Management tool to drive real-time actions based on Elastic data intelligence. You'll be happy in this role if you enjoy authoring and editing collateral (solution briefs, how-to guides, presentations, webinars), developing integration demos, and using the collateral you've created to publicly evangelize the solution. You should enjoy collaborating with partners and developing relationships with technical peers at partners, while sometimes also negotiating, advocating, and pushing for the best result for Elastic. Success will be measured based on the quality of the solution development and the impact that it has on Elastic sales. Efforts will be prioritized with a set of partners that integrate with Elastic commercial features.

 

What You Will Be Doing:

  • Onboarding, educating and enabling our partners, and supporting them in sales cycles.
  • Enabling our partner engineers to get Elastic certified and provide relevant coaching and mentoring so they become self sufficient
  • Providing reference architectures the help partners position Elastic solutions to their end customers
  • Serving as the technical point of contact for Elastic APACs must trusted and strategic partners 
  • Developing and maintaining a deep understanding of the Elastic products and solutions to demonstrate the value of our offerings in sales meetings, and at events such as meetups and conferences.
  • Advising the sales team on effective ways of positioning Elastic products, solutions and services.
  • Creating collateral, contributing to programs and collaborating with other Elasticians
  • Being the voice of the customer/partner and community to communicate needs, gaps, and enhancements to our engineering and leadership teams.
  • Deepening both your sales and technical skills through self driven education while taking advantage of all the professional development opportunities provided by Elastic.

 

What You Will Bring Along:

  • A track record of success in solution development with partners, which might include time in an integrations team or product marketing.
  • An understanding of the Elastic Stack or closely related technologies
  • An ability to influence. Have you more than once convinced a team you worked for, of an idea, technology, or architectural pattern?
  • The ability to inspire groups, both large and small.
  • Experience of working through partners including enabling the partners, running technical workshops & joint customer technical calls.
  • Ability to operate at scale through virtual meetings, webinars and making global content APAC/country relevant. 
  • Experience in working across APAC with cross-functional teams
  • Experience in working along with Cloud providers & GSI's will be an added plus.
  • A willingness to travel 30% of the time
  • A bachelor's degree in Computer Science, Engineering or Information Systems

Why Elastic?

“Once I started, I was blown away to realize, what was always so obvious, everything is search. Anything you do with data involves search, every app needs search, even analytics are about search. Daily I am blown away with the types of things our users and customers are doing with search.” Director of Product, Elastic.

Why this team?

“We question the status quo, debate the best ways to accomplish our goals and hold ourselves and our teams to the highest standards of performance. We take pride in our ability to understand our customers needs and make recommendations that are based on extensive global experience across multiple use cases. We learn from our mistakes and losses and celebrate our successes with an eye towards constant improvement. Join this team if you are up for a compassionate, understanding and fair environment where you can really grow into the next phase of your career.” VP WW Solution Architecture, Elastic

Additional Information:

We're looking to hire team members invested in realizing the goal of making real-time data exploration easy and available to anyone. As a distributed company, we believe that diversity drives our vibe! Whether you're looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life.

  • Competitive pay based on the work you do here and not your previous salary
  • Equity
  • Global minimum of 16 weeks of paid in full parental leave (moms & dads)
  • Generous vacation time and one week of volunteer time off
  • Your age is only a number. It doesn't matter if you're just out of college or your children are; we need you for what you can do.

#LI-DL1

Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.

Verified by
Product Lead, Observability
You may also like