Data Quality Engineer at FLO HEALTH·

Fast data growth and the importance of data-driven decisions make the data catalog one of the core components of data management. Flo Health adheres to high standards in engineering, including data solutions. The company itself has grown rapidly in recent years, and the accompanying increase in data made it obvious that we needed a solution to issues of data ownership, quality, and discoverability, as well as data governance.

So how did we resolve these issues?

Background and data-related issues:

Data ownership and responsibilities

Most data pipelines and datasets were owned by product teams and data analysts, but some pipelines with complex calculation logic were owned by data engineers. There were also cases that had multiple owners, or none. On top of that, responsibilities for data owners were not clearly defined. This meant that some data pipelines were implemented using best data engineering standards, covered with high-quality tests, and routinely validated, but because other data owners did not consider those things to be their responsibility, that wasn’t happening across the board.

Data discoverability and observability

With growing volumes of data, complex data pipelines, changing data sources, and an increasing number of people dealing with data, data discovery and observability became a challenge. Cases when it was difficult to determine the business context of data, find proper data for analysis because similar data was stored in multiple tables, or understand downstream processes began to appear.

Data trustability and governance

We didn’t have a simple entry point that would let data users both find proper data for analysis and know whether this data is trustable or not (i.e., if it had been tested, by what type of tests, and when it had last been successfully tested). There also wasn’t a centralized place to find all de-identified PII data in the storage or an automatic mechanism to identify potential data noncompliance in terms of privacy ahead of time.

What we needed

So, we needed a single entry point for working with the data that could resolve all our data issues. Of course, we also had high technical expectations:

  • Ability to integrate with various data sources: Glue, Databricks, Looker, etc.
  • Rich data lineage
  • REST API

In addition, we wanted the tool to have a clear and simple UI and UX so that it wouldn’t create constraints in data governance process adoption. Everyone in the company, regardless of technical skillset, needed to be able to easily gather insights from data.

There’s a multitude of solutions on the market:

  • Open-source solutions: Amundsen, DataHub, Magda, Atlas, etc.
  • Proprietary solutions: Alation, Atlan, Collibra, etc.
  • Mono-cloud solutions: Google Cloud Data Catalog, Azure Data Catalog
  • Data observability platforms: Datafold, Monte Carlo

However, not all of them could meet our needs and fit a reasonable budget. You can find an overview of high-level tools here: github.com/Alexkuva/awesome-data-catalogs. To make the decision, our data engineers performed a comprehensive analysis of available open-source and proprietary solutions and agreed to go with Atlan.

What we can do with Atlan

  1. Collect and centralize all the company’s metadata in one place and add necessary technical and business information for each entity (e.g., table, column, dashboard).
  2. Get transparency in data pipelines with the help of automated data lineage across multiple sources.
  3. Achieve clarity on data ownership. All core tables contain the name of the responsible team owner in Atlan. People are also assigned as owners and experts.
  4. Develop a business glossary and connect data with appropriate glossary terms to help users understand the context of data assets.
  5. Integrate our data quality tests with Atlan metadata via REST API to automatically set the status of test executions to a particular table.
  6. Run data profiling functionality to collect base statistical information about the data.
  7. Query all tables directly from Atlan, save and share SQL queries, and collaborate on issues via integrated chat.
  8. Comply with security and confidentiality regulations for data user management via access policies, user groups, and roles.
  9. Auto-detect PII using provided column names and auto-glossary recommendation options.

This isn’t everything Atlan can do — just the functionalities that we’re currently using the most at Flo right now. It should be mentioned that we’ve been using Atlan for a little less than a year, and we’re currently in the process of data catalog adoption among users. So far, we haven’t faced any bottlenecks in the Atlan functionalities related to our needs. I’m excited to see how it goes.

READ LESS
8 upvotes·33.9K views
Avatar of Marina Skuryat

Marina Skuryat

Data Quality Engineer at FLO HEALTH