Cultivating your Data Lake

1,612
Segment
Segment provides the customer data infrastructure that businesses use to put their customers first. With Segment, companies can collect, unify, and connect their first-party data to over 200 marketing, analytics, and data warehousing tools.

This post is by Lauren Reeder


All too often, we hear that businesses want to do more with their customer data. They want to be data-informed, they want to provide better customer experiences, and—most of all—they just want to understand their customers.

Getting there isn’t easy. Not only do you need to collect and store the data, you also need to identify the useful pieces and act on the insights.

At Segment, we’ve helped thousands of businesses walk the path toward becoming more data-informed. One successful technique we’ve seen time and time again is establishing a working data lake.

A data lake is a centralized repository that stores both structured and unstructured data and allows you to store massive amounts of data in a flexible, cost effective storage layer. Data lakes have become increasingly popular both because businesses have more data than ever before, and it’s never been cheaper and easier to collect and store it all.

In this post, we’ll dive into the different layers to consider when working with a data lake.

  • We’ll start with an object store, such as S3 or Google Cloud Storage, as a cheap and reliable storage layer.
  • Next is the query layer, such as Athena or BigQuery, which will allow you to explore the data in your data lake through a simple SQL interface.
  • A central piece is a metadata store, such as the AWS Glue Catalog, which connects all the metadata (its format, location, etc.) with your tools.
  • Finally, you can take advantage of a transformation layer on top, such as EMR, to run aggregations, write to new tables, or otherwise transform your data.

As heavy users of all of these tools in AWS, we’ll share some examples, tips, and recommendations for customer data in the AWS ecosystem. These same concepts also apply to other clouds and beyond.

Storage Layer: S3

If you take one idea away from this blog post, let it be this: store a raw copy of your data in S3.

It’s cheap, scalable, incredibly reliable, and plays nicely with the other tools in the AWS ecosystem. It’s very likely your entire storage bill for S3 will cost you less than a hundred dollars per month. If we look across our entire customer base, less than 1% of our customers have an S3 bill over $100/month for data collected by Segment.

That said, the simplicity of S3 can be a double-edged sword. While S3 is a great place to keep all of your data, it often requires a lot of work to collect the data, load it, and actually get to the insights you’re looking for.

There are three important factors to keep in mind when collecting and storing data on S3:

  • encoding – data files can be encoded any number of ways (CSV, JSON, Parquet, ORC), and each one can have big performance implications.
  • batch size – file size has important ramifications, both for your uploading strategy (and data freshness) and for your query times.
  • partition scheme – partitions refers to the ‘hierarchy’ for data, and the way your data is partitioned or structured can impact search performance.

Structuring data within your data lake

We’ll discuss each of these in a bit more depth, but first it’s worth understanding how data first enters your data lake.

There are a number of ways to get data into S3, such as uploading via the S3 UI or CLI. But if you’re talking customer data, it’s easy to start delivering your data to S3 via the Segment platform. The Segment platform provides the infrastructure to collect, clean, and control your first party customer data and send exactly what you need to all the tools you need it in.

Encoding

The encoding of your files has a significant impact on the performance of your queries and data analysis. For large workloads, you’ll want to use a binary format like Parquet or ORC (we’re beginning to support these natively. If you’d like beta access, please get in touch!).

To understand why, consider what a machine has to do to read JSON vs Parquet.

When looking at JSON, the data looks something like this:

{ 'userId': 'user-1', 'name': 'Lauren', 'company': 'Segment' }
{ 'userId': 'user-2', 'name': 'Parsa', 'company': 'Segment }
{ 'userId': 'user-3', 'company': 'Microsoft', 'name': 'Satya' }
{ 'userId': 'user-4', 'name': 'Elon', 'company': 'Tesla' }

Here, we must parse not only the whole message, but each key individually, and each value. Because each JSON object might have a different schema (and is totally unordered), we have to do roughly the same work for each row.

Additionally, even if we are just picking out companies, or names, we have to parse all of our data. There’s no ‘shortcut’ where we can jump to the middle of a given row.

Contrast that with Parquet, and we see a much different schema. In Parquet, we’ve pre-defined the schema ahead of time, and we end up storing columns of data together. Below is an example of the previous JSON document transformed in Parquet format. You can see the users are stored together on the right, as they are all in the same column.

See users stored together on the right

A reader doesn’t have to parse out and keep a complicated in-memory representation of the object, nor does it have to read entire lines to pick out one field. Instead it can quickly jump to the section of the files it needs and parse out the relevant columns.

Instead of just taking my word for it, below are a few concrete benchmarks which query both JSON and Parquet.

In each of the four scenarios, we can see big gains from using Parquet.

As you can see, the data we need to query in each instance is limited for Parquet. For JSON, we need to query the full body of each JSON event every time.

Batch Size

Batch size, or the amount of data in each file, is tricky to tune. Having too large of a batch means that you will have to re-upload or re-process a lot of data in the case of a hiccup or machine failure. Having a bunch of files which are too small means that your query times will likely be much longer.

Batch size is also tied to encoding, which we discussed above. Certain formats like Parquet and ORC are ‘splittable’, where files can be split and re-combined at runtime. JSON and CSV can be splittable under certain conditions, but typically cannot be split for faster processing.

Generally, we try and target files with sizes ranging from 256 MB to 1 GB. We find this gives the best overall performance mix.

Partitioning

When you start to have more than 1GB of data in each batch, it’s important to think about how a data set is split, or partitioned. Each partition contains only a subset of the data. This increases performance by reducing the amount of data that must be scanned when querying with a tool like Athena or processing data with EMR. For example, a common way to partition data is by date.

Querying

Finally, it’s worth understanding that just having your data in S3 doesn’t really directly help you do any of the things we talked about at the beginning of the post. It’s like having a hard drive, but no CPU.

There are many ways to examine this data — you could download it all, write some code, or try loading it into some other database.

But the easiest is just to write SQL. That’s where Athena comes in.

Query Layer: Athena 🔎

Once you’ve got your data into S3, the best way to start exploring what you’ve collected is through Athena.

Athena is a query engine managed by AWS that allows you to use SQL to query any data you have in S3, and works with most of the common file formats for structured data such as Parquet, JSON, CSV, etc.

In order to get started with Athena, you just need to provide the location of the data, its format, and the specific pieces you care about. Segment events in particular have a specific format, which we can leverage when creating tables for easier analysis.

Setup

Below is an example to set up a table schema in Athena, which we’ll use to look at how many messages we’ve received by type:

CREATE EXTERNAL TABLE IF NOT EXISTS segment_logs.eventlogs (
anonymousid                 string                  ,  # pick columns you care about!
context                     map<string,string>      ,  # using a map for nested JSON
messageid                   string                  ,   
timestamp                   Timestamp               ,   
type                        string                  ,   
userid                      string                  ,   
traits                      map<string,string>      ,  
event                       string                   
)
PARTITIONED BY (sourceid string)               # partition by the axes you expect to query often, sourceid here is associated with each source of data
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://your-s3-bucket/segment-logs'    # location of your data in S3

In addition to creating the table, you will need to add the specific partitions:

ALTER TABLE eventlogs ADD
    PARTITION (sourceid='source1') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source1/'  # sourceid here is associated with each source of data
    PARTITION (sourceid='source2') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source2/'
    PARTITION (sourceid='source3') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source3/'
    PARTITION (sourceid='source4') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source4/'

There are many ways to partition your data. Here, we’ve partitioned by source for each customer. This works for us when we’re looking at a specific customer, but if you’re looking across all customers over time, you may want to partition by date instead.

Query time!

Let’s answer a simple question from the table above. Let’s say we want to know how many messages of each type we saw for a given data source in the past day - we can simply run some SQL to find out from the table we just created in Athena:

  select  type, count(messageid) 
    from  eventlogs 
   where  sourceid='source1' 
     and  date_trunc('day', timestamp) = current_date 
group by  1 
order by  2 desc

For all queries, the cost of Athena is tightly related to how you partition your data and its format. It is also driven by how much data is scanned ($5 per TB).

When scanning JSON, you will be scanning the entire record every time due to how it’s structured (see above for an example). Alternatively, you can set up Parquet for a subset of your data containing only the columns you care about, which is great for limiting your table scans and therefore limiting cost. This is also why Parquet can be so much faster—it has direct access to specific columns without scanning the full JSON.

Metadata: AWS Glue 🗺

Staying current

One challenge with Athena is keeping your tables up to date as you add new data to S3. Athena doesn’t know where your new data is stored, so you need to either update or create new tables, similar to the query above, in order to point Athena in the right direction. Luckily there are tools to help manage your schema and keep the tables up to date.

The AWS Glue Catalog is a central location in which to store and populate table metadata across all your tools in AWS, including Athena. You can populate the catalog either using out of the box crawlers to scan your data, or directly populate the catalog via the Glue API or via Hive. You can see how these all fit together in the diagram below.

Once this is populated with your metadata, Athena and EMR can reference the Glue Catalog for the location, type, and more when querying or otherwise accessing data in S3.

From: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html

Compute Layer: EMR

Moving beyond one-off queries and exploratory analysis, if you want to modify or transform your data, a tool like EMR (Elastic Map Reduce) gives you the power to not only read data but transform it and write into new tables. You may need to write if you want to transform the format of your data from JSON to Parquet, or if you want aggregate % of users completed the signup flow the past month and write it to another table for future use.

Operating EMR

EMR provides managed Hadoop on top of EC2 (AWS’s standard compute instances). Some code and config is required - internally we use Spark and Hive heavily on top of EMR. Hive provides a SQL interface over your data and Spark is a data processing framework that supports many different languages such as Python, Scala, and Java. We’ll walk through an example and more in-depth explanation of each below.

Pattern-wise, managing data with EMR is similar to how Athena operates. You need to tell it where your data is and its format. You can do this each time you need to run a job or take advantage of a central metastore like the AWS Glue Catalog mentioned earlier.

Building on our earlier example, let’s use EMR to find out how many messages of each type we received not only over the past day, but for every day over the past year. This requires going through way more data than we did with Athena, which means we should make a few optimizations to help speed this up.

Data Pre-processing

The first optimization we should make is to transform our data from JSON to Parquet. This will allow us to significantly cut down on the amount of data we need to scan for the final query, as shown previously!

For this JSON to Parquet file format transformation, we’ll want to use Hive, then turn to Spark for the aggregation steps.

Hive is a data warehousing system with a SQL interface for processing large amounts of data and has been around since 2010. Hive really shines when you need to do heavy reads and writes on a ton of data at once, which is exactly what we need when converting all our historical data from JSON into Parquet.

Below is an example of how we would execute this JSON to Parquet transformation.

First, we create the destination table with the final Parquet format we want, which we can do via Hive.

CREATE EXTERNAL TABLE `test_parquet`(
  anonymousid                 string                  ,  
  context                     map<string,string>      ,  
  messageid                   string                  ,   
  timestamp                   Timestamp               ,   
  type                        string                  ,   
  userid                      string                  ,   
  traits                      map<string,string>      ,  
  event                       string                   
)
PARTITIONED BY (dt string)  -- dt will be the prefix on your output files, i.e. s3://your-data-lake/parquet/dt=1432432423/object1.gz
STORED AS PARQUET  -- specify the format you want here
location 's3://your-data-lake/parquet/';

Then we simply need to read from the original JSON table and insert into the newly created Parquet table:

INSERT INTO test_parquet partition (dt) 
SELECT anonymousid, context, messageId, `timestamp`, `type`, userid, traits, event 
FROM test_json;

To actually run this step, we will need to create an EMR job to put some compute behind it. You can do this, by submitting a job to EMR via the UI:

Or, by submitting a job via the CLI:

# EMR CLI example job, with lots of config!
aws emr add-steps \
  --cluster-id j-xxxxx \
  --steps Type=spark, Name=SparkWordCountApp, \ 
  Args=[
    --deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/], \ 
  ActionOnFailure=CONTINUE

Aggregations

Now that we have our data in Parquet format, we can take advantage of Spark to sum how many messages of each type we received and write the results into a final table for future reference.

Spark is useful to run computations or aggregations over your data. It supports languages beyond SQL such as Python, R, Scala, Java, etc. which have more complex logic and libraries available. It also has in memory caching, so intermediate data doesn’t write to disk.

Below is a Python example of a Spark job to do this aggregation of messageid by type.

from datetime import datetime, timezone, timedelta
from pyspark.sql.functions import col, when, count, desc


# S3 buckets for reading Segment logs and writing aggregation output
read_bucket_prefix = 's3://bizops-data-lake/segment-logs/source1'
write_bucket_prefix = "s3://bizops-data-lake-development/tmp/segment-logs-source1"

# get datestamp for prior year
today = datetime.now()
last_year_partition = datetime.strftime(today - timedelta(years=today.weekday(), years=1), '%Y-%m-%d')
last_year_ds = datetime.strptime(last_year_partition, '%Y-%m-%d')


"""
  obtain all logs partitions of the year
  sample filenames:
  [
    's3://bizops-data-lake/segment-logs/source1/1558915200000/', 
    's3://bizops-data-lake/segment-logs/source1/1559001600000/', 
    's3://bizops-data-lake/segment-logs/source1/1559088000000/',
    ...
  ]
"""
read_partitions = []
for day in range(365):
    next_ds = last_year_ds + timedelta(days=day)
    ts_partition = int(1000*next_ds.replace(tzinfo=timezone.utc).timestamp())
    read_year_partitions.append("{}/{}/".format(read_bucket_prefix, ts_partition))

# bucket partition for aggregation output
# sample 's3://bizops-data-lake-development/tmp/segment-logs-source1/week_ds=2019-05-27/'
write_year_ds = "{}/week_start_ds={}/".format(write_bucket_prefix, last_year_partition)

# read logs of last year, from pre-processing step. Faster with parquet!
df = spark.read.parquet(read_year_partitions)

# aggregate by message type
agg_df = df.select("type", "messageid").groupBy("type").agg(
  count(messageid).alias("message_count"),
)

# writing Spark output dataframe to final S3 bucket in parquet format
agg_df.write.parquet(path=write_year_ds, compression='snappy', mode='overwrite')

It is this last step, agg_df.write.parquet, that takes the updated aggregations that are stored in an intermediate format, a DataFrame, and writes these aggregations to a new bucket in Parquet format.

Conclusion

All in, there is a robust ecosystem of tools that are available to get value out of the large amounts of data that can be amassed in a data lake.

It all starts with getting your data into S3. This gives you an incredibly cheap, reliable place to store all your data.

From S3, it’s then easy to query your data with Athena. Athena is perfect for exploratory analysis, with a simple UI that allows you to write SQL queries against any of the data you have in S3. Parquet can help cut down on the amount of data you need to query and save on costs!

AWS Glue makes querying your S3 data even easier, as it serves as the central metastore for what data is where. It is already integrated with both Athena and EMR, and has convenient crawlers that can help map out your data types and locations.

Finally, EMR helps take your data lake to the next level, with the ability to transform, aggregate, and create new rollups of your data with the flexibility of Spark, Hive, and more. It can be more complex to manage, but its data manipulation capabilities are second to none.

At Segment, we help enable a seamless integration with these same systems. Our S3 destination enables customers to have a fresh copy of all their customer and event data in their own AWS accounts.

We’re working on making this even easier by expanding the file format options, as well as integrating with the AWS Glue metastore, so you always have an up to date schema to keep up to date with your latest data. Drop us a line if you want to be part of the beta!

Special thanks to Parsa Shabani, Calvin French-Owen, and Udit Mehta

Segment
Segment provides the customer data infrastructure that businesses use to put their customers first. With Segment, companies can collect, unify, and connect their first-party data to over 200 marketing, analytics, and data warehousing tools.
Tools mentioned in article
Open jobs at Segment
Engineering Manager, Tooling
San Francisco, CA

Overview: 

The Segment Tooling team is looking for a technical manager who is passionate about creating high quality, modern solutions to delivering complex engineering efforts and joining our focus to define the future of Customer Data Platforms.
 
Our Tooling team focuses on our ability to deploy and build at scale, to create best practices across our engineering organization for security, compliance, and enables our highly autonomous engineering teams to execute safely with the latest technologies.
 
We often bring new technologies to the business, and apply them at scale to our hardest challenges. We’ve recently brought in Buildkite for CI/CD and are developing significant technology around it, we invest in solutions like Atlantis, and also create in-house tools, in partnership with other engineering and security teams.
 
Segment operates at a large scale and fast pace, and we enable and accelerate that.
If you are technically adept, enjoy creating tools and systems for engineering, and are passionate about management of a small high impact team, this may be the role for you!

Who We Are @ Segment Engineering:

What we do on the Tooling Team:

  • We identify limitations of our current engineering practices and develop solutions to improve them.
  • We drive the future of CI/CD with technologies like BuildKite, Atlantis, and many other technologies we create and curate.
  • We work with teams like Security to create tools that can help transform our security and compliance posture as we grow.
  • We work with our SRE & other engineering teams to deploy technologies at scale and create tooling and automation to enable many engineers to use them successfully.
  • We have a culture of exceptional testing, automated end-to-end testing, shadow deploys, game days, and high observability.
  • We partner with our customers and exceptional embedded product leadership to ensure the work we do has impact with our customers and our engineering community.

Who we are looking for:

  • You will be a hands on technical manager for the team of 3 engineers, providing technical guidance and team support.
  • You lead by example and care for your teams.
  • You establish credibility through the quality of your own and your team's technical execution and ownership.
  • You have a rare ability to collaborate with and influence diverse engineering teams to improve the reliability, scalability, and durability of their services.
  • You mentor and develop each employee to reach their full potential.
  • You give engineers the tools, confidence, and motivation to make decisions independently that lead to their personal and professional success.
  • You know how to manage multiple projects and balance support & development for teams like Segment tooling.

Requirements:

  • 3+ years of experience as Technical Lead or Manager of a healthy and high-performing engineering team.
  • Previous operational responsibility for business critical systems, in a collaborative environment.
  • Strong sense of urgency and ownership over critical problem areas.
  • CS, EE, or CE degree and/or a demonstrable, solid understanding of CS fundamentals.
  • Solid grasp of Linux systems and networking concepts.
  • Demonstrably effective communicator — in spoken and written forms.

Bonus:

  • Experience with tooling teams in startups.
  • Experience with modern platforms for code build, deploy, and container operations.
 
We encourage you to apply if this role excites you - even if you think you may not meet all of the qualifications. At Segment we live by four values: karma, drive, tribe and focus. We are always looking for outstanding individuals with diverse backgrounds and perspectives who embody these values. To learn more about life at Segment and our commitment to diversity, equity, and inclusion, visit our LinkedIn page. We’re excited to meet you!
 
Segment is an equal opportunity employer. We believe that everyone should receive equal consideration and treatment. Recruitment, hiring, placements, transfers, and promotions will happen based on qualifications for the positions being filled regardless of sex, gender identity, race, religious creed, color, national origin ancestry, age, physical disability, pregnancy, mental disability, or medical condition.
Engineering Manager, Tooling
Vancouver, BC

Overview: 

The Segment Tooling team is looking for a technical manager who is passionate about creating high quality, modern solutions to delivering complex engineering efforts and joining our focus to define the future of Customer Data Platforms.
 
Our Tooling team focuses on our ability to deploy and build at scale, to create best practices across our engineering organization for security, compliance, and enables our highly autonomous engineering teams to execute safely with the latest technologies.
 
We often bring new technologies to the business, and apply them at scale to our hardest challenges. We’ve recently brought in Buildkite for CI/CD and are developing significant technology around it, we invest in solutions like Atlantis, and also create in-house tools, in partnership with other engineering and security teams.
 
Segment operates at a large scale and fast pace, and we enable and accelerate that.
If you are technically adept, enjoy creating tools and systems for engineering, and are passionate about management of a small high impact team, this may be the role for you!

Who We Are @ Segment Engineering:

What we do on the Tooling Team:

  • We identify limitations of our current engineering practices and develop solutions to improve them.
  • We drive the future of CI/CD with technologies like BuildKite, Atlantis, and many other technologies we create and curate.
  • We work with teams like Security to create tools that can help transform our security and compliance posture as we grow.
  • We work with our SRE & other engineering teams to deploy technologies at scale and create tooling and automation to enable many engineers to use them successfully.
  • We have a culture of exceptional testing, automated end-to-end testing, shadow deploys, game days, and high observability.
  • We partner with our customers and exceptional embedded product leadership to ensure the work we do has impact with our customers and our engineering community.

Who we are looking for:

  • You will be a hands on technical manager for the team of 3 engineers, providing technical guidance and team support.
  • You lead by example and care for your teams.
  • You establish credibility through the quality of your own and your team's technical execution and ownership.
  • You have a rare ability to collaborate with and influence diverse engineering teams to improve the reliability, scalability, and durability of their services.
  • You mentor and develop each employee to reach their full potential.
  • You give engineers the tools, confidence, and motivation to make decisions independently that lead to their personal and professional success.
  • You know how to manage multiple projects and balance support & development for teams like Segment tooling.

Requirements:

  • 3+ years of experience as Technical Lead or Manager of a healthy and high-performing engineering team.
  • Previous operational responsibility for business critical systems, in a collaborative environment.
  • Strong sense of urgency and ownership over critical problem areas.
  • CS, EE, or CE degree and/or a demonstrable, solid understanding of CS fundamentals.
  • Solid grasp of Linux systems and networking concepts.
  • Demonstrably effective communicator — in spoken and written forms.

Bonus:

  • Experience with tooling teams in startups.
  • Experience with modern platforms for code build, deploy, and container operations.
 
We encourage you to apply if this role excites you - even if you think you may not meet all of the qualifications. At Segment we live by four values: karma, drive, tribe and focus. We are always looking for outstanding individuals with diverse backgrounds and perspectives who embody these values. To learn more about life at Segment and our commitment to diversity, equity, and inclusion, visit our LinkedIn page. We’re excited to meet you!
 
Segment is an equal opportunity employer. We believe that everyone should receive equal consideration and treatment. Recruitment, hiring, placements, transfers, and promotions will happen based on qualifications for the positions being filled regardless of sex, gender identity, race, religious creed, color, national origin ancestry, age, physical disability, pregnancy, mental disability, or medical condition.
Engineering Manager, Tooling
, US

Overview: 

The Segment Tooling team is looking for a technical manager who is passionate about creating high quality, modern solutions to delivering complex engineering efforts and joining our focus to define the future of Customer Data Platforms.
 
Our Tooling team focuses on our ability to deploy and build at scale, to create best practices across our engineering organization for security, compliance, and enables our highly autonomous engineering teams to execute safely with the latest technologies.
 
We often bring new technologies to the business, and apply them at scale to our hardest challenges. We’ve recently brought in Buildkite for CI/CD and are developing significant technology around it, we invest in solutions like Atlantis, and also create in-house tools, in partnership with other engineering and security teams.
 
Segment operates at a large scale and fast pace, and we enable and accelerate that.
If you are technically adept, enjoy creating tools and systems for engineering, and are passionate about management of a small high impact team, this may be the role for you!

Who We Are @ Segment Engineering:

What we do on the Tooling Team:

  • We identify limitations of our current engineering practices and develop solutions to improve them.
  • We drive the future of CI/CD with technologies like BuildKite, Atlantis, and many other technologies we create and curate.
  • We work with teams like Security to create tools that can help transform our security and compliance posture as we grow.
  • We work with our SRE & other engineering teams to deploy technologies at scale and create tooling and automation to enable many engineers to use them successfully.
  • We have a culture of exceptional testing, automated end-to-end testing, shadow deploys, game days, and high observability.
  • We partner with our customers and exceptional embedded product leadership to ensure the work we do has impact with our customers and our engineering community.

Who we are looking for:

  • You will be a hands on technical manager for the team of 3 engineers, providing technical guidance and team support.
  • You lead by example and care for your teams.
  • You establish credibility through the quality of your own and your team's technical execution and ownership.
  • You have a rare ability to collaborate with and influence diverse engineering teams to improve the reliability, scalability, and durability of their services.
  • You mentor and develop each employee to reach their full potential.
  • You give engineers the tools, confidence, and motivation to make decisions independently that lead to their personal and professional success.
  • You know how to manage multiple projects and balance support & development for teams like Segment tooling.

Requirements:

  • 3+ years of experience as Technical Lead or Manager of a healthy and high-performing engineering team.
  • Previous operational responsibility for business critical systems, in a collaborative environment.
  • Strong sense of urgency and ownership over critical problem areas.
  • CS, EE, or CE degree and/or a demonstrable, solid understanding of CS fundamentals.
  • Solid grasp of Linux systems and networking concepts.
  • Demonstrably effective communicator — in spoken and written forms.

Bonus:

  • Experience with tooling teams in startups.
  • Experience with modern platforms for code build, deploy, and container operations.
 
We encourage you to apply if this role excites you - even if you think you may not meet all of the qualifications. At Segment we live by four values: karma, drive, tribe and focus. We are always looking for outstanding individuals with diverse backgrounds and perspectives who embody these values. To learn more about life at Segment and our commitment to diversity, equity, and inclusion, visit our LinkedIn page. We’re excited to meet you!
 
Segment is an equal opportunity employer. We believe that everyone should receive equal consideration and treatment. Recruitment, hiring, placements, transfers, and promotions will happen based on qualifications for the positions being filled regardless of sex, gender identity, race, religious creed, color, national origin ancestry, age, physical disability, pregnancy, mental disability, or medical condition.
Senior Fullstack Engineer, Integrations
, US

Overview:

 
At Segment, we believe companies should be able to send their data wherever they want, whenever they want, with no fuss. We make this easy with a single platform that collects, stores, filters, transforms, and sends data to hundreds of business tools with the flip of a switch. Our goal is to make it easy to understand, extract value, and protect the integrity of data. We are creating a world where engineers spend their time working on their core product, letting us take care of the complexities of processing their customer data reliably at scale. We’re in the running to take over the entire customer data ecosystem, and we need the best people to take the market.
 

Who we are:

We develop our product across a number of teams based in San Francisco, Vancouver and New York each with a particular product feature focus area. From collecting data through analytics.js, to building powerful tools for data governance, to implementing algorithms that can handle complex billing scenarios at scale, to visualizing our customers’ data flow through our pipelines, to making complex implementations manageable for our large enterprise customers, to optimizing Sign Up conversion, our product teams are focused on creating fantastic user experiences. We're looking for talented engineers that are passionate about building world-class experiences that delight our customers.
 

What We Do:

 

Who We're Looking For:

You can turn complex business requirements into working software that our customers love to use.
  • You're proud of the code you write, but you're also pragmatic.
  • You know when it is time to refactor, and when it's time to ship.
  • You're focused, driven and can get challenging projects across the finish line.
  • You're empathetic, patient and love to help your teammates grow.
  • You have experience running apps in production and take software engineering practices seriously. You write meaningful tests and understand the value of great logging, proper monitoring and error tracking.

Projects You Could Work On:

  • We’re rethinking the integrations model completely, and re-imagining how customer data can flow in and out of where it’s most useful.
  • We’re owning Segment’s most valuable integrations to the largest, most complex tools.
  • We’re making it possible for customers and partners to build on Segment with Functions, Dev Center, and more.

Requirements:

  • You can write client side JavaScript or TypeScript using modern language features.
  • You have experience with a backend language of choice: Golang, Java, Ruby, Python, etc.
  • You have some familiarity with Golang or are excited to learn it.
  • Minimum of 3 years of industry experience in engineering or some cool projects on GitHub you think we'll love to check out.
  • You provide a deep understanding of the complexities involved in writing large single-page applications.
  • You show evidence of exposure to architectural patterns of high-scale web application (e.g., well-designed APIs, high volume data pipelines, efficient algorithms).
  • You value teamwork and agree with the statement that “a team is a group of people who own each others’ decisions”.
 
We encourage you to apply if this role excites you - even if you think you may not meet all requirements listed above. At Segment we live by four values: karma, drive, tribe and focus. We are always looking for outstanding individuals with diverse backgrounds and perspectives who embody these values. To learn more about life at Segment and our commitment to diversity, equity, and inclusion, visit our LinkedIn page. We’re excited to meet you!
 
Segment is an equal opportunity employer. We believe that everyone should receive equal consideration and treatment. Recruitment, hiring, placements, transfers, and promotions will happen based on qualifications for the positions being filled regardless of sex, gender identity, race, religious creed, color, national origin ancestry, age, physical disability, pregnancy, mental disability, or medical condition. 
Verified by
CEO and Co-founder
Sr. Manager, Acquisition
Engineering Manager
You may also like