The Apache Kafka open source software is one of the best solutions for storing and pro­cessing data streams. This messaging and streaming platform, which is licensed under Apache 2.0, features fault tolerance, excellent scalab­il­ity, and a high read and write speed. These factors, which are highly appealing for Big Data ap­plic­a­tions, are based on a cluster which allows dis­trib­uted data storage and rep­lic­a­tion. There are four different in­ter­faces for com­mu­nic­at­ing with the cluster with a simple TCP protocol serving as the basis for com­mu­nic­a­tion.

This Kafka tutorial explains how to get started with the Scala-based ap­plic­a­tion, beginning with in­stalling Kafka and the Apache ZooKeeper software required to use it.

Re­quire­ments for using Apache Kafka

To run a powerful Kafka cluster, you will need the right hardware. The de­vel­op­ment team re­com­mends using quad-core Intel Xeon machines with 24 gigabytes of memory. It is essential that you have enough memory to always cache the read and write accesses for all ap­plic­a­tions that actively access the cluster. Since Apache Kafka’s high data through­put is one of its draws, it is crucial to choose a suitable hard drive. The Apache Software Found­a­tion re­com­mends a SATA hard drive (8 x 7200 RPM). When it comes to avoiding per­form­ance bot­tle­necks, the following principle applies: the more hard drives, the better.

In terms of software, there are also some re­quire­ments that must be met in order to use Apache Kafka for managing incoming and outgoing data streams. When choosing your operating system, you should opt for a Unix operating system, such as Solaris, or a Linux dis­tri­bu­tion. This is because Windows platforms receive limited support. Apache Kafka is written in Scala, which compiles to Java bytecode, so you will need the latest version of the Java SE De­vel­op­ment Kits (JDK) installed on your system. This also includes the Java Runtime En­vir­on­ment, which is required for running Java ap­plic­a­tions. You will also need the Apache ZooKeeper service, which syn­chron­ises dis­trib­uted processes.

gZj16chk0Ss.jpg To display this video, third-party cookies are required. You can access and change your cookie settings here.

Apache Kafka tutorial: how to install Kafka, ZooKeeper, and Java

For an ex­plan­a­tion of what software is required, see the previous part of this Kafka tutorial. If it is not already installed on your system, we recommend in­stalling the Java Runtime En­vir­on­ment first. Many newer versions of Linux dis­tri­bu­tions, such as Ubuntu, which is used as an example operating system in this Apache Kafka tutorial (Version 17.10), already have OpenJDK, a free im­ple­ment­a­tion of JDK, in their official package re­pos­it­ory. This means that you can easily install the Java De­vel­op­ment Kit via this re­pos­it­ory by typing the following command into the terminal:

sudo apt-get install openjdk-8-jdk

Im­me­di­ately after in­stalling Java, install the Apache ZooKeeper process syn­chron­isa­tion service. The Ubuntu package re­pos­it­ory also provides a ready-to-use package for this service, which can be executed using the following command line:

sudo apt-get install zookeeperd

You can then use an ad­di­tion­al command to check whether the ZooKeeper service is active:

sudo systemctl status zookeeper

If Apache ZooKeeper is running, the output will look like this:

If the syn­chron­isa­tion service is not running, you can start it at any time using this command:

sudo systemctl start zookeeper

To ensure that ZooKeeper will always launch auto­mat­ic­ally at startup, add an autostart entry at the end:

sudo systemctl enable zookeeper

Finally, create a user profile for Kafka, which you will need to use the server later. To do so, open the terminal again and enter the following command:

sudo useradd kafka -m

Using the passwd password manager, you can then add a password to the user by typing the following command followed by the desired password:

sudo passwd kafka

Next, grant the user “kafka” sudo rights:

sudo adduser kafka sudo

You can now log in at any time with the newly-created user profile:

su – kafka

We have arrived at the point in this tutorial where it is time to download and install Kafka. There are a number of trusted sources where you can download both older and current versions of the data stream pro­cessing software. For example, you can obtain the in­stall­a­tion files directly from the Apache Software Found­a­tion’s download directory. It is highly re­com­men­ded that you work with a current version of Kafka, so you may need to adjust the following download command before entering it into the terminal:

wget http://www.apache.org/dist/kafka/2.1.0/kafka_2.12-2.1.0.tgz

Since the down­loaded file will be com­pressed, you will then need to unpack it:

sudo tar xvzf kafka_2.12-2.1.0.tgz --strip 1

Use the --strip 1 flag to ensure that the extracted files are saved directly to the ~/kafka directory. Otherwise, based on the version used in this Kafka tutorial, Ubuntu would store all files in the ~/kafka/kafka_2.12-2.1.0 directory. To do this, you must have pre­vi­ously created a directory named “kafka” using mkdir and switched to it (via “cd kafka”).

Kafka: how to set up the streaming and messaging system

Now that you have installed Apache Kafka, the Java Runtime En­vir­on­ment and ZooKeeper, you can run the Kafka service at any time. Before you do this, however, you should make a few small ad­just­ments to its con­fig­ur­a­tions so that the software is optimally con­figured for its upcoming tasks.

Enable deleting topics

Kafka does not allow you to delete topics (i.e. the storage and cat­egor­isa­tion com­pon­ents in a Kafka cluster) in its default set-up. However, you can easily change this by using the server.prop­er­ties Kafka con­fig­ur­a­tion file. To open this file, which is located in the config directory, use the following terminal command in the default text editor nano:

sudo nano ~/kafka/config/server.properties

At the end of this con­fig­ur­a­tion file, add a new entry, which enables you to delete topics:

delete.topic.enable=true
Tip

Remember to save the new entry in the Kafka con­fig­ur­a­tion file before closing the nano text editor again

Creating .service files for ZooKeeper and Kafka

The next step in the Kafka tutorial is to create unit files for ZooKeeper and Kafka that allow you to perform common actions such as starting, stopping and re­start­ing the two services in a manner con­sist­ent with other Linux services. To do so, you need to create and set up .service files for the systemd session manager for both ap­plic­a­tions.

How to create the ap­pro­pri­ate ZooKeeper file for the Ubuntu systemd session manager

First, create the file for the ZooKeeper syn­chron­isa­tion service by entering the following command in the terminal:

sudo nano /etc/systemd/system/zookeeper.service

This will not only create the file but also open it in the nano text editor. Now, enter the following lines and then save the file.

[Unit]
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
User=kafka
ExecStart=/home/kafka/kafka/bin/zookeeper-server-start.sh /home/kafka/kafka/config/zookeeper.properties
ExecStop=/home/kafka/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target

As a result, systemd will un­der­stand that ZooKeeper requires the network and the file system to be ready before it can start. This is defined in the [Unit] section. The [Service] section specifies that the session manager should use the files zookeeper-server-start.sh and zookeeper-server-stop.sh to start and stop ZooKeeper. It also specifies that ZooKeeper should be restarted auto­mat­ic­ally if it stops un­ex­pec­tedly. The [Install] entry controls when the file is started using ".multi-user.target” as the default value for a multi-user system (e.g. a server).

How creating a Kafka file for the Ubuntu systemd session manager works

To create the .service file for Apache Kafka, use the following terminal command:

sudo nano /etc/systemd/system/kafka.service

Then, copy the following content into the new file that has already been opened in the nano text editor:

[Unit]
Requires=zookeeper.service
After=zookeeper.service
[Service]
Type=simple
User=kafka
ExecStart=/bin/sh -c '/home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/config/server.properties > /home/kafka/kafka/kafka.log 2>&1'
ExecStop=/home/kafka/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target

The [Unit] section in this file specifies that the Kafka service depends on ZooKeeper. This ensures that the syn­chron­isa­tion service is auto­mat­ic­ally started whenever the kafka.service file is run. The [Service] section specifies that the kafka-server-start.sh and kafka-server-stop.sh shell files should be used for starting and stopping the Kafka server. You can also find the spe­cific­a­tion for an automatic restart after an un­ex­pec­ted dis­con­nec­tion as well as the multi-user entry in this file.

Kafka: launching for the first time and creating an autostart entry

Once you have suc­cess­fully created the session manager entries for Kafka and ZooKeeper, you can start Kafka with the following command:

sudo systemctl start kafka

By default, the systemd program uses a central protocol or journal in which all log messages are auto­mat­ic­ally written. As a result, you can easily check whether the Kafka server has been started as desired:

sudo journalctl -u kafka

The output should look something like this:

If you have suc­cess­fully started Apache Kafka manually, enable finish by ac­tiv­at­ing automatic start during system boot:

sudo systemctl enable kafka

Apache Kafka tutorial: getting started with Apache Kafka

This part of the Kafka tutorial involves testing Apache Kafka by pro­cessing an initial message using the messaging platform. To do so, you need a producer and a consumer (i.e. an instance which enables you to write and publish data to topics and an instance which can read data from a topic). First of all, you need to create a topic, which in this case should be called Tu­tori­alTop­ic. Since this is a simple test topic, it should only contain a single partition and a single replica:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TutorialTopic

Next, you need to create a producer that adds the first example message "Hello, World!" to the newly created topic. To do so, use the kafka-console-producer.sh shell script, which needs the Kafka server’s host name and port (in this example, Kafka’s default path) as well as the topic name as arguments:

echo "Hello, World!" | ~/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TutorialTopic > /dev/null

Next, using the kafka-console-consumer.sh script, create a Kafka consumer that processes and displays messages from Tu­tori­alTop­ic. You will need the Kafka server’s host name and port as well as the topic name as arguments. In addition, the “--from-beginning” argument is attached so that the consumer can actually process the “Hello, World!” message, which was published before the consumer was created:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TutorialTopic --from-beginning

As a result, the terminal presents the “Hello, World!" message with the script running con­tinu­ously and waiting for more messages to be published to the test topic. So, if the producer is used in another terminal window for ad­di­tion­al data input, you should also see this reflected in the window in which the consumer script is running.

Go to Main Menu