Hyperpartitioning Apache Kafka: Embracing Conway’s Law with Pyrostore

Michael Drogalis - 05/21/2018

The shelf life of even the best architecture is limited by its ability to evolve alongside the business. Apache Kafka has seen massive adoption by companies not only as a product, but also as a design philosophy.

So, is Kafka well-equipped to cultivate evolution?

Organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations. — Conway's law

On one hand, Kafka’s immutable log enables adaptation through its ability to replay the past. Adding new functionality that requires a different dimension of your data? Great: rewind and restate the data. Hydrating a new service? Same deal: revisit history, in order.

Conducted by Kafka, our services grow and mature with internal autonomy because their consumption of data is decoupled, conceptually, from the data’s producer.

But, in practice, a latent dependency does exist between producers and consumers. When we create a topic, we must make an intrinsic choice about how to partition its data.

Cemented in time, this partition strategy must be accepted by all future consumers, whether or not it’s an optimal access pattern for them. This inflexibility makes evolution particularly awkward because partitioning plays two roles in Kafka: It’s both a unit of parallelism and a segmentation strategy.

Kafka partitioning as temporal coupling

Ideally, consumers could receive streams that contain only the data they’re interested in, irrespective of any temporal coupling with established partitions. Without this expressivity, we can only place the burden of these needs on the consuming service layer, for example to read back all the data in every partition, filtering out only relevant information.

This isn’t evolution. It’s shifting (and increasing) complexity within the system. At scale, this pressure threatens to constrain the organization at large.

We can solve this problem with a new streaming primitive.

Introducing hyperpartitioning

Pyrostore complements Kafka’s classic partitions with a familiar, yet more expressive counterpart.

Hyperpartitions are user-defined segmentations of records, which maintain order with respect to the source topic. Access them just like classic partitions, with full support for Kafka consumer semantics.

Check out the Pyrostore announcement post to learn more about what it delivers.

Compared to the classic partitioning strategy, hyperpartitioning:

  • Supports partition identification by numerous types, not just integers
  • Dynamically provisions new partitions when unique keys are activated
  • Sustains orders of magnitude more partitions
  • Serves multiple, independent hyperpartition strategies per topic

Expressive consumers, backed by serverless scalability

Hyperpartitions are backed by commodity storage, care of Pyrostore, rather than stored in Kafka. This means their read and write needs are off-loaded away from your brokers. Further, new hyperpartitioned topics can be spun-up directly from your Pyrostore-managed archive, for example by parallelizing the work across Amazon S3.

Hyperpartitioning data flow

Example: hyperpartitioning geography

As an example, let’s say that you have a topic of user location events, originally partitioned by user ID ranges. Suppose you have an application that’s interested in streams based on the user’s location. Hyperpartitioning allows you to construct a log-ordered partition per Geohash, many thousands in total, so that you can access a stream containing only location data you’re interested in:

// Set up our custom hyperpartitioner.
public class GeoHashingHyperPartitioner implements StringHyperPartition<Integer, Map> {
    private static final int PRECISION = 4;
    
    @Override
    public String hyperPartition(ConsumerRecord<Integer, Map> record) {
        double latitude = record.value().get("latitude");
        double longitude = record.value().get("longitude");
    
        return com.example.Geohash.fromCoordinates(latitude, longitude, PRECISION);
    }
}

By limiting the precision of the Geohash, we discretize each event’s coordinates, which means our hyperpartitions group events by geographical proximity.

In this example, our consumers gain a new access pattern: live and historical events, grouped spatially and ordered by event timestamp.

Hyperpartitioning data flow
// Instantiate a target hyperpartition keyed to the Geohash for South Bay, SF.
TopicHyperPartition<String> hyperPartition = new TopicHyperPartition<>("9q9jh");
// Create the consumer; the third type argument matches the hyperpartition type.
PyrostoreHyperPartitionConsumer<Integer, Map, String> consumer = new PyrostoreHyperPartitionConsumer(props);
// Map our hyperpartition to a TopicPartition, so that it's type-compatible
// with the standard Consumer interface implemented by the hyperpartition consumer.
TopicPartition partition = consumer.hyperPartitionToTopicPartition(hyperPartition);
consumer.assign(Arrays.asList(partition));
// Access records in this geography
ConsumerRecords<String, Map> eventsFromSouthBaySF = consumer.poll(500);
processEvents(eventsFromSouthBaySF);
// Commit our progress for this hyperpartition.
consumer.commit();

Having hyperpartitioned by geographical proximity, the consumer is able to consume only the events that it cares about. This alternative layout ensures optimal playback of the events along these dimensions, with Kafka-like commit and ordering semantics.

SQL + hyperpartitioning

Pyrostore also seamlessly enables SQL access to hyperpartitioned topics, via Amazon Athena. This opens the door to data exploration, analytics dashboards, topic-to-topic joins and other pull-oriented applications.

-- List the geographies with the most events in the past ten months.
SELECT key as geog, COUNT(*) as num_events
    FROM "geohashed_user_events"
    WHERE timestamp > 1500000000000
    GROUP BY geog
    ORDER BY num_events DESC
    LIMIT 50;

As with the consumer, hyperpartitioned data is efficiently queried by Athena along chosen hyperpartitions. In our example use case, given a query for a specific Geohash, Pyrostore's optimizations allow Athena to only scan data matching the supplied geography.

Embracing Conway’s law

To the extent that an organization is not completely flexible in its communication structure, that organization will stamp out an image of itself in every design it produces. — M. Conway in How Do Committees Invent?

Demobilized data reflects a demobilized organization. When teams can access the information they need for insights and can capitalize on new dimensions of their data’s value, this expressive flexibility is mirrored by the organization at large.

Watch how Bobby Calderwood puts Conway's law in the context of event sourcing and enterprise-scale Kafka.

It follows that this scalability of access can extend outside a business – to customers and the public. In today’s privacy-aware world, this presents an opportunity to espouse responsible personal data use, backed by a system that’s designed for it.

Our next posts will detail how Pyrostore and hyperpartitioning intersect with regulations like the GDPR. You can catch those posts by signing up to our new newsletter. If you’d like to talk more about using Pyrostore, ping us with the form below!

- Mike & the Pyrostore team