It’s time to realize Apache Kafka’s full potential, spanning past and present

Michael Drogalis - 05/10/2018

Kafka users enjoy a broad sweet spot, one that can naturally grow in the context of use cases and in concert with the organization that runs it. You can simply run it as a message bus. Or, it can drive reactive microservices. In its most sophisticated form, Kafka promises to be the central nervous system (CNS) for a business, turning the database inside out.

We can go further. Kafka’s power inspires an ideal system, where all data is stored in Kafka – forever. In such a setting, topics spanning live and historical data can be seamlessly consumed (and queried) as a single source of truth, or replayed in one brush without gluing together auxiliary storage. With a CNS at the heart of event streams that keep our data indefinitely, where else could Kafka take us?

Far too often, reality falls short of this vision. In practice, we subject our data to a memory cliff, where tradeoffs in our operation of Kafka have lossy effects on stream-ability. Balancing costs and operational feasibility, we ask Kafka to forget older data through retention policies. This data is either thrown away or cast off to another storage medium.

Three strategies for Kafka retention: throw away, ship offline and pay for scale

If instead we aim to avoid the memory cliff and keep all our data streaming and transparently replayable, our only recourse is to embrace the heavy cost burden of scaling Kafka.

What makes scaling Kafka expensive?
  • One size fits all for disk types: No way to choose faster, more expensive disks for newer data and cheaper disks for historical data.
  • Operational complexity: Must manage capacity constraints and partitions’ on-disk placement so adequate space exists for new records.
  • Resource competition: Reading & replaying historical streams can negatively affect throughput and latency of live production services.

But what if we didn’t have to compromise?

Pyrostore: streaming storage that never forgets.

Today, I’m excited to publicly announce Pyrostore: a new streaming storage product that complements Kafka with inexpensive, virtually limitless storage.

  • Wield Kafka’s streaming abstractions across live & historical data.
  • Replay your entire dataset seamlessly, losslessly, on demand.
  • Never settle for deleting or demobilizing data.

Pyrostore minimizes costs & risks because we designed it from the ground up to use cloud object stores like Amazon S3 and Google Cloud Storage. Our novel storage strategy combines content-addressing with tree-based indexing to offer Kafka’s semantics backed by these commodity services.

These underpinnings furnish operational advantages including resource scalability, high availability and cross-region replication. Meanwhile, read scalability for consumers can dramatically improve, because load hits cloud storage instead of Kafka.

Pyrostore's unified data flow creates a continuum between Kafka, commodity storage and your applications

Pyrostore integrates with existing Kafka clusters and applications with a minimal footprint. Simply start the Archiver Docker container and point it at an existing Kafka topic. Pyrostore will archive the topic’s records into cloud storage with an efficient, open format.

Inside your consuming applications, integration is dead simple:

Properties props = new Properties();
props.put(PyrostoreConsumerConfig.S3_BUCKET, "s3://my-bucket");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, IntegerDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);

Consumer<Integer, String> consumer = new PyrostoreKafkaS3Consumer<>(props);

A single source of truth. Full spectrum access.

Pyrostore’s consumer reads records directly out of cloud storage, and it’s intelligent enough to cross its reads back into Kafka when records are not yet available in the archive. Moreover, Pyrostore’s consumer implements the Kafka Consumer interface in its entirety, so it plays well with the existing Kafka ecosystem of tools.

Pyrostore’s approach to data storage has the additional benefit that Kafka records can be leveraged by Amazon’s fleet of services. It is specifically designed for its serverless SQL tool, Amazon Athena.

As a complement to KSQL’s continuous query style, Athena offers highly parallel, ad-hoc query power. Plug into other services, like Amazon QuickSight for BI and Amazon SageMaker for machine learning, all orchestrating over one data set.

What about Kafka Connect?

Kafka Connect with cloud storage is great for some use cases. But it misses valuable opportunities.

In particular, it lacks a direct mechanism to deliver historical data to consumers or Kafka Streams. Manually flowing data back through Kafka is technically complex, and it can quickly exceed capacities for throughput & disk space.

Time to actualize Kafka’s potential …

We’re excited to see more teams join the ranks of those already running Pyrostore. For organizations streaming at company scale (or aiming to), it’s now practical to command a streaming central nervous system that maximizes data’s past, present & future value. Stay tuned for more posts about how Pyrostore works and the additional opportunities it unlocks.

If you’re interested in using Pyrostore, let’s get in touch!

- Mike & the Pyrostore team