Need advice about which tool to choose?Ask the StackShare community!

InfluxDB

1K
1.2K
+ 1
174
Apache Spark

2.9K
3.5K
+ 1
140
Add tool

Apache Spark vs InfluxDB: What are the differences?

Introduction

In this article, we will discuss the key differences between Apache Spark and InfluxDB.

  1. Scalability and Workloads: Apache Spark is designed for processing and analyzing large-scale distributed datasets, making it suitable for big data processing and complex analytics. On the other hand, InfluxDB is a time series database that is optimized for handling time-stamped or time-series data. It excels in storing and retrieving high volumes of time-series data efficiently.

  2. Data Model: Apache Spark supports a flexible data model that allows the processing of structured, semi-structured, and unstructured data. It can handle a wide range of data formats and structures. In contrast, InfluxDB has a specific data model focused on time-series data. It provides efficient storage and retrieval of time-series data along with built-in support for downsampling and data retention policies.

  3. Real-time Data Processing: Apache Spark provides real-time stream processing capabilities through its Spark Streaming module. It can process continuous streams of data in real-time and apply transformations on the fly. InfluxDB, on the other hand, is optimized for high-speed ingestion and querying of time-series data, making it ideal for real-time monitoring and analytics scenarios.

  4. Analytics Capabilities: Apache Spark offers a rich set of built-in analytics and machine learning libraries. It provides a wide range of algorithms and tools for data exploration, statistical analysis, and machine learning. InfluxDB, on the other hand, primarily focuses on efficient storage and retrieval of time-series data. While it doesn't provide built-in analytics capabilities like Spark, it can be integrated with other tools for performing analytics on the stored time-series data.

  5. Data Processing Paradigm: Apache Spark supports various data processing paradigms, including batch processing, interactive queries, streaming, and machine learning. It provides a unified programming model for all these paradigms. InfluxDB, on the other hand, is primarily focused on time-series data processing and doesn't support other paradigms like batch processing or machine learning out of the box.

  6. Ecosystem and Integration: Apache Spark has a vibrant and extensive ecosystem with support for various connectors, libraries, and tools. It can seamlessly integrate with other big data technologies like Hadoop, HBase, Kafka, etc. InfluxDB, while not as extensive as Spark's ecosystem, provides integrations with popular tools like Grafana for data visualization and Kapacitor for real-time data processing.

In summary, Apache Spark is a versatile big data processing platform with support for various data types and processing paradigms, while InfluxDB is a specialized time-series database optimized for efficient storage and retrieval of time-series data.

Advice on InfluxDB and Apache Spark
Needs advice
on
HadoopHadoopInfluxDBInfluxDB
and
KafkaKafka

I have a lot of data that's currently sitting in a MariaDB database, a lot of tables that weigh 200gb with indexes. Most of the large tables have a date column which is always filtered, but there are usually 4-6 additional columns that are filtered and used for statistics. I'm trying to figure out the best tool for storing and analyzing large amounts of data. Preferably self-hosted or a cheap solution. The current problem I'm running into is speed. Even with pretty good indexes, if I'm trying to load a large dataset, it's pretty slow.

See more
Replies (1)
Recommends
on
DruidDruid

Druid Could be an amazing solution for your use case, My understanding, and the assumption is you are looking to export your data from MariaDB for Analytical workload. It can be used for time series database as well as a data warehouse and can be scaled horizontally once your data increases. It's pretty easy to set up on any environment (Cloud, Kubernetes, or Self-hosted nix system). Some important features which make it a perfect solution for your use case. 1. It can do streaming ingestion (Kafka, Kinesis) as well as batch ingestion (Files from Local & Cloud Storage or Databases like MySQL, Postgres). In your case MariaDB (which has the same drivers to MySQL) 2. Columnar Database, So you can query just the fields which are required, and that runs your query faster automatically. 3. Druid intelligently partitions data based on time and time-based queries are significantly faster than traditional databases. 4. Scale up or down by just adding or removing servers, and Druid automatically rebalances. Fault-tolerant architecture routes around server failures 5. Gives ana amazing centralized UI to manage data sources, query, tasks.

See more
Nilesh Akhade
Technical Architect at Self Employed · | 5 upvotes · 522.4K views

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

See more
Replies (2)
Recommends
on
ElasticsearchElasticsearch

The first solution that came to me is to use upsert to update ElasticSearch:

  1. Use the primary-key as ES document id
  2. Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.

Cons: The load on ES will be higher, due to upsert.

To use Flink:

  1. Create a KeyedDataStream by the primary-key
  2. In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
  3. When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
  4. When the Timer fires, read the 1st record from the State and send out as the output record.
  5. Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State

Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.

See more
Akshaya Rawat
Senior Specialist Platform at Publicis Sapient · | 3 upvotes · 365.9K views
Recommends
on
Apache SparkApache Spark

Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"

See more
Needs advice
on
InfluxDBInfluxDBMongoDBMongoDB
and
TimescaleDBTimescaleDB

We are building an IOT service with heavy write throughput and fewer reads (we need downsampling records). We prefer to have good reliability when comes to data and prefer to have data retention based on policies.

So, we are looking for what is the best underlying DB for ingesting a lot of data and do queries easily

See more
Replies (3)
Yaron Lavi
Recommends
on
PostgreSQLPostgreSQL

We had a similar challenge. We started with DynamoDB, Timescale, and even InfluxDB and Mongo - to eventually settle with PostgreSQL. Assuming the inbound data pipeline in queued (for example, Kinesis/Kafka -> S3 -> and some Lambda functions), PostgreSQL gave us a We had a similar challenge. We started with DynamoDB, Timescale and even InfluxDB and Mongo - to eventually settle with PostgreSQL. Assuming the inbound data pipeline in queued (for example, Kinesis/Kafka -> S3 -> and some Lambda functions), PostgreSQL gave us better performance by far.

See more
Recommends
on
DruidDruid

Druid is amazing for this use case and is a cloud-native solution that can be deployed on any cloud infrastructure or on Kubernetes. - Easy to scale horizontally - Column Oriented Database - SQL to query data - Streaming and Batch Ingestion - Native search indexes It has feature to work as TimeSeriesDB, Datawarehouse, and has Time-optimized partitioning.

See more
Ankit Malik
Software Developer at CloudCover · | 3 upvotes · 324.5K views
Recommends
on
Google BigQueryGoogle BigQuery

if you want to find a serverless solution with capability of a lot of storage and SQL kind of capability then google bigquery is the best solution for that.

See more
Decisions about InfluxDB and Apache Spark
Benoit Larroque
Principal Engineer at Sqreen · | 2 upvotes · 134.6K views

I chose TimescaleDB because to be the backend system of our production monitoring system. We needed to be able to keep track of multiple high cardinality dimensions.

The drawbacks of this decision are our monitoring system is a bit more ad hoc than it used to (New Relic Insights)

We are combining this with Grafana for display and Telegraf for data collection

See more
Get Advice from developers at your company using StackShare Enterprise. Sign up for StackShare Enterprise.
Learn More
Pros of InfluxDB
Pros of Apache Spark
  • 58
    Time-series data analysis
  • 30
    Easy setup, no dependencies
  • 24
    Fast, scalable & open source
  • 21
    Open source
  • 20
    Real-time analytics
  • 6
    Continuous Query support
  • 5
    Easy Query Language
  • 4
    HTTP API
  • 4
    Out-of-the-box, automatic Retention Policy
  • 1
    Offers Enterprise version
  • 1
    Free Open Source version
  • 61
    Open-source
  • 48
    Fast and Flexible
  • 8
    One platform for every big data problem
  • 8
    Great for distributed SQL like applications
  • 6
    Easy to install and to use
  • 3
    Works well for most Datascience usecases
  • 2
    Interactive Query
  • 2
    Machine learning libratimery, Streaming in real
  • 2
    In memory Computation

Sign up to add or upvote prosMake informed product decisions

Cons of InfluxDB
Cons of Apache Spark
  • 4
    Instability
  • 1
    Proprietary query language
  • 1
    HA or Clustering is only in paid version
  • 4
    Speed

Sign up to add or upvote consMake informed product decisions

- No public GitHub repository available -

What is InfluxDB?

InfluxDB is a scalable datastore for metrics, events, and real-time analytics. It has a built-in HTTP API so you don't have to write any server side code to get up and running. InfluxDB is designed to be scalable, simple to install and manage, and fast to get data in and out.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Need advice about which tool to choose?Ask the StackShare community!

What companies use InfluxDB?
What companies use Apache Spark?
See which teams inside your own company are using InfluxDB or Apache Spark.
Sign up for StackShare EnterpriseLearn More

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with InfluxDB?
What tools integrate with Apache Spark?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Mar 24 2021 at 12:57PM

Pinterest

GitJenkinsKafka+7
3
2140
MySQLKafkaApache Spark+6
2
2004
Aug 28 2019 at 3:10AM

Segment

PythonJavaAmazon S3+16
7
2556
What are some alternatives to InfluxDB and Apache Spark?
TimescaleDB
TimescaleDB: An open-source database built for analyzing time-series data with the power and convenience of SQL — on premise, at the edge, or in the cloud.
Redis
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams.
MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).
Prometheus
Prometheus is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
See all alternatives