The operation of apps, web services and server ap­plic­a­tions, to name but a few, presents a variety of chal­lenges for those running them. For example, one of the most common chal­lenges is ensuring that data stream trans­mis­sions are unimpeded and that they are processed as quickly and ef­fi­ciently as possible. The messaging and streaming ap­plic­a­tion Apache Kafka is a piece of software that greatly sim­pli­fies this challenge. Ori­gin­ally developed as a message queuing service for LinkedIn, this open source solution now provides a com­pre­hens­ive solution for data storage, transfer and pro­cessing.

What is Kafka?

Apache Kafka is a platform-in­de­pend­ent open source ap­plic­a­tion belonging to the Apache Software Found­a­tion which focuses on data stream pro­cessing. The project was ori­gin­ally launched in 2011 by LinkedIn, the company behind the social network for pro­fes­sion­als bearing the same name. The aim was to develop a message queue. Since its license-free launch (Apache 2.0), this software’s cap­ab­il­it­ies have been greatly extended, trans­form­ing what was a simple queuing ap­plic­a­tion into a powerful streaming platform with a wide range of functions. It is used by well-known companies such as Netflix, Microsoft and Airbnb.

Founded by the original de­velopers of Apache Kafka in 2014, Confluent delivers the most complete version of Apache Kafka with Confluent Platform. It extends the program with ad­di­tion­al functions, some of which are also open source, while others are com­mer­cial.

What are Apache Kafka’s core functions?

Apache Kafka is primarily designed to optimise the trans­mis­sion and pro­cessing of data streams trans­ferred via a direct con­nec­tion between the data receiver and data source. Kafka acts as a messaging instance between the sender and the receiver, providing solutions to the common chal­lenges en­countered with this type of con­nec­tion.

For example, the Apache platform provides a solution to the inability to cache data or messages when the receiver is not available (e.g. due to network problems). In addition, a properly set-up Kafka queue prevents the sender from over­load­ing the receiver. This type of situation always occurs when in­form­a­tion is sent faster than it can be received and processed during a direct con­nec­tion. Lastly, the Kafka software is also ideal for situ­ations in which the target system receives the message but crashes during pro­cessing. While the sender would normally assume that pro­cessing has occurred despite the crash, Apache Kafka reports the failure to the sender.

Unlike pure message queuing services such as databases, Apache Kafka is fault tolerant. This means that the software satisfies re­quire­ments to continue pro­cessing messages and data. Combined with its high scalab­il­ity and ability to dis­trib­ute trans­mit­ted in­form­a­tion across any number of systems (dis­trib­uted trans­ac­tion log), this makes Apache Kafka an excellent solution for all services which need to ensure that data is stored and processed quickly while main­tain­ing high avail­ab­il­ity.

An overview of the Apache Kafka ar­chi­tec­ture

Apache Kafka runs as a cluster on one or more servers that can span multiple data centres. The in­di­vidu­al servers in the cluster, known as brokers, store and cat­egor­ise incoming data streams into topics. The data is divided into par­ti­tions, rep­lic­ated and dis­trib­uted within the cluster, and assigned a time stamp. As a result, the streaming platform ensures high avail­ab­il­ity and a fast read access time. Apache Kafka dif­fer­en­ti­ates between normal topics and compacted topics. In normal topics, Kafka can delete messages as soon as the storage period or storage limit has been exceeded, whereas in compacted topics, they are not subject to time or space lim­it­a­tions.

Ap­plic­a­tions which write data in a Kafka cluster are called producers, while ap­plic­a­tions which read data from a Kafka cluster are called consumers. The central component accessed by producers and consumers when pro­cessing data streams is a Java library called Kafka Streams. By sup­port­ing trans­ac­tion­al writing, this ensures that messages are only delivered once (with no du­plic­ates). This is called exactly-once delivery.

Note

The Kafka Streams Java library is the re­com­men­ded standard solution for pro­cessing data in Kafka clusters. However, you can use Apache Kafka with other data stream pro­cessing systems as well.

The technical basics: Kafka’s in­ter­faces

This software offers five different basic in­ter­faces to give ap­plic­a­tions access to Apache Kafka:

  • Kafka Producer: The Kafka Producer API allows ap­plic­a­tions to send data streams to the broker(s) in an Apache cluster in order to be cat­egor­ised and stored (in the pre­vi­ously mentioned topics).
  • Kafka Consumer: The Kafka Consumer API gives Apache Kafka consumers read access to data stored in the cluster’s topics.
  • Kafka Streams: The Kafka Streams API allows an ap­plic­a­tion to act as a stream processor to convert incoming data streams into outgoing data streams.
  • Kafka Connect: The Kafka Connect API makes it possible to build reusable producers and consumers which connect Kafka topics with existing ap­plic­a­tions or database systems.
  • Kafka Ad­min­Cli­ent: The Kafka Ad­min­Cli­ent API makes it possible to easily manage and inspect Kafka clusters.

Com­mu­nic­a­tion between client ap­plic­a­tions and in­di­vidu­al servers in Apache clusters occurs via a simple, powerful and language-in­de­pend­ent protocol based on TCP. The de­velopers provide a Java client for Apache Kafka by default, but there are also clients in a variety of other languages such as PHP, Python, C/C++, Ruby, Perl and Go.

Use case scenarios for Apache Kafka

From the outset, Apache Kafka was designed for high read and write through­put. When combined with the pre­vi­ously mentioned APIs and its high flex­ib­il­ity, scalab­il­ity and fault tolerance, this makes the open source software appealing for a variety of use cases. Apache Kafka is par­tic­u­larly well-suited for the following ap­plic­a­tions:

  • Pub­lish­ing and sub­scrib­ing to data streams: This open source project started out by using Apache Kafka as a messaging system. Although the software’s functions have been extended, it is still best suited for direct message trans­mis­sion via the queuing system as well as for broadcast message trans­mis­sion.
  • Pro­cessing data streams: Apache Kafka is a powerful tool for ap­plic­a­tions which need to react in real time to specific events and which need to process data streams as quickly and ef­fect­ively as possible for this purpose.
  • Storing data streams: Apache Kafka can also be used as a fault-tolerant, dis­trib­uted storage system, whether it’s 50 kilobytes or 50 terabytes of con­sist­ent data that you need to store on the server(s).

Naturally, all these elements can be combined as desired, allowing Apache Kafka, as a complex streaming platform, to not only store data and make it available at any time, but also to process it in real time and link it to all desired ap­plic­a­tions and systems.

An overview of common use cases for Apache Kafka:

  • Messaging system
  • Web analytics
  • Storage system
  • Data stream processor
  • Event sourcing
  • Log file analysis and man­age­ment
  • Mon­it­or­ing solutions
  • Trans­ac­tion log
Go to Main Menu