laitimes

How to use Apache Kafka to handle large applications with 100 million users

author:Not bald programmer
How to use Apache Kafka to handle large applications with 100 million users

In the world of big data and high-traffic applications, dealing with a large number of users at the same time is a huge challenge. Many of the world's most popular applications, serving more than 100 million users, rely on robust, scalable architectures to manage the flood of data and requests. A key player in these architectures is Apache Kafka, a distributed event streaming platform known for its high throughput, reliability, and scalability. In this article, we'll explore how large applications can use Apache Kafka to handle 100 million users, focusing on the architecture and features that make this possible.

Apache Kafka is one of the most popular platforms for managing large amounts of data and can handle millions of users. For example

Kafka is used by many companies, including LinkedIn, Uber, and Netflix.

LinkedIn uses Kafka for message exchange, activity tracking, and log metrics processing, processing 7 trillion messages per day across more than 100 Kafka clusters.

Uber uses Kafka to exchange data between users and drivers.

Netflix uses Kafka to track the activity of more than 230 million subscribers, including viewing history, movie likes and dislikes, and what you watch.

PhonePe uses Apache Kafka to manage approximately 100 billion events per day. Kafka is used to build real-time streaming applications and data pipelines that process data and move it between systems. Kafka also operates as a distributed system, can scale to handle many applications, and can store data as needed.

Some of the other companies that use Kafka include Uber, Shopify, Spotify, Udemy, LaunchDarkly, Slack, Robinhood, and CRED.

Kafka is also used for messaging, metrics collection and monitoring, logging, event sourcing, submission logs, and real-time analytics.

Kafka works like a pub-sub message queue, allowing users to publish and subscribe to message streams. It also handles stream processing, dynamically computing derived datasets and streams, rather than just delivering batches of messages.

Kafka is one of the most reliable solutions for building reliable event-driven architectures (EDAs) that provide reliable, real-time banking services. EDA enables data to flow asynchronously between loosely coupled event consumers and event producers.

To determine the size of capacity for Kafka Streams, you can monitor the operating system's performance counters to determine if CPU or network is becoming a bottleneck. If the CPU is the bottleneck, you can use more CPU by adding more threads to the application. If the network is the bottleneck, you can add another machine and run a clone of the Kafka Streams application there. Kafka Streams will automatically balance the load across all tasks running on all machines.

How to use Apache Kafka to handle large applications with 100 million users

Here are some examples of real-world use cases for Kafka:

Social Media Analytics

  • A social media platform uses Kafka Streams to process tweets in real-time, conduct sentiment analysis, and provide dashboards with user sentiment, trends, and engagement metrics.

manufacturing

  • Kafka is used to monitor production lines, equipment status, and inventory levels in real time. This helps optimize production efficiency, reduce downtime, and improve supply chain management.

Website activity tracking

  • When you visit a website and perform actions such as logging in, searching, or clicking on a product, Kafka captures these events. Kafka then sends the message flow to a specific topic based on the type of event.

The role of Apache Kafka in high-volume applications

Apache Kafka is designed to handle real-time data streams at scale. Its primary role in high-volume applications is as the backbone of data stream processing, enabling efficient processing of millions of messages per second. Kafka's architecture allows it to seamlessly distribute data across multiple servers (or proxies), ensuring high availability and fault tolerance.

Key features of Apache Kafka:

  1. High throughput: Kafka can process millions of messages per second and support high-volume data streams without significant performance degradation.
  2. Scalability: Kafka clusters can be scaled with no downtime to accommodate more producers, consumers, and data, as the user base grows.
  3. Durability and reliability: Kafka ensures that data is not lost, stores messages on disk, and replicates data in the cluster for fault tolerance.
  4. Low latency: Kafka optimizes low-latency messaging, which is critical for real-time applications and services.
  5. Data flow decoupling: Producers and consumers are decoupled, allowing processing applications to scale and evolve independently.

An overview of the architecture that handles 100 million users

To handle 100 million users, an application must have a well-thought-out architecture that leverages Kafka's features. Here's a simplified overview of such an architecture:

1. Data Ingestion Layer:

The data ingestion layer is where data enters the system from various sources. This can include user interactions, application logs, system metrics, and more. Kafka's role at this layer is to efficiently collect and buffer this incoming data, ensuring that it is ready for processing by downstream systems.

2. Stream Processing:

Once the data is in Kafka, it can be streamed. This involves real-time analysis and processing of data streams that can be used for a variety of purposes, such as real-time analysis, monitoring, and triggering automated actions. Kafka Streams is a client-side library for building applications and microservices, where input and output data is stored in Kafka topics, typically used at this layer.

3. Data Integration:

Kafka also facilitates data integration and serves as a central hub for the flow of data between different systems. This is critical in a microservices architecture, where different application components may need to communicate and share data effectively.

4. Scalable Storage:

Kafka provides persistent storage for data streams, allowing applications to retain and reprocess large amounts of data as needed. This is especially important for applications that need to retain historical data for analysis or regulatory compliance.

5. Event-Driven Architecture:

Kafka enables event-driven architecture, where producers publish events (messages) without needing to know the details of the consumer. This decoupling allows for greater scalability, flexibility, and elasticity of the system.

How many messages per second can Apache Kafka process at Honeycomb?

At Honeycomb, easily surpass a million messages.

In this episode, learn how Honeycomb uses Kafka at scale. Liz Fong-Jones (Honeycomb Principal Development Advocate) explains how Honeycomb manages its Kafka-based telemetry ingestion pipeline and scales its Kafka cluster.

So what is Honeycomb? Honeycomb is an observability platform that helps you visualize, analyze, and improve the quality and performance of your cloud applications. Their data volume grew 10-fold during the pandemic, while the total cost of ownership increased by only 20%.

To help understand benchmarking, let me give me a quick recap of what Kafka is and how it works. Kafka is a distributed messaging system originally built on LinkedIn and is now part of the Apache Software Foundation and is used by several companies.

The general setup is very simple. The producer sends the records to the cluster, which keeps them and distributes them to consumers:

How to use Apache Kafka to handle large applications with 100 million users

The key abstraction in Kafka is the theme. Producers publish their records to a topic, and consumers subscribe to one or more topics. A Kafka topic is just a sharded write-ahead log. The producer appends the records to these logs, and the consumer subscribes to the change. Each record is a key/value pair. The key is used to assign records to a log partition (unless the publisher specifies the partition directly).

Here is a simple example of a single producer and a consumer reading and writing from two partitioned themes.

How to use Apache Kafka to handle large applications with 100 million users

This image shows a producer process appending logs to two partitions, and a consumer reading the same logs. Each record in the log has an associated entry number, which we call an offset. The consumer uses this offset to describe its position in each log.

These partitions are distributed across a group of machines, allowing a single topic to store more data than any single machine can handle.

Note that, unlike most messaging systems, logs are always persistent. Messages are written to the file system as soon as they are received. Messages are not deleted after they are read, but are retained according to some configurable service-level agreement, such as a few days or a week. This allows for use in situations where the data consumer may need to reload the data. It also makes it possible to support spatially efficient publish-subscribe because there is a shared log no matter how many consumers there are, whereas in traditional messaging systems, there is usually a queue for each consumer, so adding a consumer doubles the size of your data. This makes Kafka ideal for use outside of traditional messaging systems, such as as conduits for offline data systems such as Hadoop. These offline systems may only load at intervals of periodic ETL cycles, or they may be shut down for a few hours for maintenance, during which Kafka is able to buffer even terabytes of unconsumed data if needed.

Kafka also replicates its logs across multiple servers for fault tolerance. An important architectural aspect of our replication implementation compared to other messaging systems is that replication is not an add-on that requires complex configuration and is only used in very specific cases. Instead, replication is assumed to be the default: we treat unreplicated data as a special case with a replication factor of exactly one.

When a producer publishes a message that contains a record offset, they receive an acknowledgment receipt. The first record published to the partition is given an offset of 0, the second record is 1, and so on, in an increasing sequence. The consumer consumes data from a specified offset location and saves their position in the log by periodically committing: save this offset in case the consumer instance crashes and another instance needs to recover from its location.

Test setup

For these tests, I have six machines, each with the following specifications:

  • Intel Xeon 2.5 GHz processor, six cores
  • Six 7200 RPM SATA hard drives
  • 32GB RAM
  • 1Gb Ethernet

The Kafka cluster is set up on three of these machines. Six hard disks are mounted directly, no RAID (JBOD style). The remaining three machines I use for Zookeeper and generate loads.

A cluster of three machines isn't huge, but since we'll only test to a replication factor of three, that's what we need. Obviously, we can always add more partitions and spread the data across more machines to scale our cluster horizontally.

These pieces of hardware aren't actually LinkedIn's usual Kafka hardware. Our Kafka machine is better suited to running Kafka, but not quite the "off-the-shelf" spirit of my test this time. Instead, I borrowed these from our Hadoop cluster, which is probably the cheapest of all our persistent systems. The usage pattern for Hadoop is very similar to that of Kafka, so it makes sense to do so.

Okay, without further ado, here are the results!

Challenges and considerations

While Kafka provides the tools and capabilities needed to handle 100 million users, there are several challenges and considerations that must be addressed:

• Data partitioning and sharding: Proper data partitioning is critical for balancing the load and achieving high throughput in a Kafka cluster. • Monitoring and management: A Kafka cluster of this size requires sophisticated monitoring and management to ensure optimal performance and resolve any issues that arise quickly. • Security: Handling the sensitive data of millions of users requires strong security measures, including encryption, access controls, and auditing. • Compliance and data governance: Large-scale systems often have to comply with a variety of regulatory requirements, requiring careful data governance practices.

conclusion

Apache Kafka's architecture and capabilities make it an excellent choice for applications that need to handle 100 million or more users. By effectively managing high-volume data streams, ensuring reliability and scalability, and supporting a decoupled, event-driven architecture, Kafka enables applications to scale to meet the needs of a large user base. However, leveraging Kafka at this scale also requires careful planning, monitoring, and management to address the associated challenges and ensure the resiliency and performance of the system.

Read on