Apache Kakfa Part II – Ubuntu Installation

Introduction

In the first part of this section, I provided a general overview of Kafka. In this post, we get a bit more hands-on, and install Kafka on an Ubuntu virtual machine.

With Kakfa, we can create multiple types of clusters, such as the following:

  • A single node cluster with a single broker
  • A single node cluster with multiple brokers
  • Multiple nodes clusters with multiple brokers

We will start out implementing the first configuration (single node, single broker), and we will then migrate our setup to a single node, multiple brokers configuration.

Operating System

The configuration I used for this post is very simple:

  • My Host machine is a Dell laptop with 32 GB of memory.
  • I am using Oracle VirtualBox. The guest OS is Ubuntu 14.04 LTS.
  • I allocated 12 GB to the guest OS, which should be plenty for running Kafka in both single- and multi broker setups.

This post picks up right after the OS installation, I had not additional software installed at this time. As a first step, we will need to install Java, after which we can proceed to the Kafka install.

Installing Java

In this case we will install the Java SE edition, since the enterprise edition is not required for a Kafka installation. We will use apt-get for our installation.

First, we need to make sure that the package index is up to date:

Let’s make sure that java is not already installed by running java -version:

If this returns “The program java cannot be found in the following package…“, Java has not been installed yet, in which case we can install it as follows:

This will install the Java Runtime Environment (JRE). If you also need the Java Development Kit (JDK), which is usually needed to compile Java applications (for example Apache Ant, Apache Maven, Eclipse, IntelliJ etc..), then you can you issue the following command:

At this point, Java should be fully installed and ready to go. We need to ensure that we have our JAVA_HOME environment variable set correctly, which is covered in the next section.

Adding the JAVA_HOME environment variable

We want to  make sure that we set the JAVA_HOME variable in a persistent way, so we don’t have to reset it after a login or reboot. Follow these steps to set your JAVA_HOME environment variable:

  • Edit the /etc/environment file with your favorite editor
  • Add the JAVA_HOME statement and point it to the correct location. In this case we want to use /usr/bin, since this is the default installation location for apt-get:

JAVA_HOME

  • Use source to re-load the variables, as is shown in the figure below:

JAVA_HOME_consumed

At this point our Java installation is complete, and we are ready to download Kafka.

Downloading Kafka

To download Kafka, got to: http://kafka.apache.org/downloads.html. I downloaded version 0.8.2.1, with Scale Version 2.10, as is shown in the figure below:

DownloadKafka

On the next page, use the default default mirror site. Once we start the download, the file will be copied to your ~/Downloads directory.

We will install Kafka off our /opt directory. Follow these steps:

  • Go to the /opt directory:

  • Copy the kafka.tgz files from your download directory:

  • Unzip the files:

After this, the kafka directory will look as follows:

KafkaDirectory

Finally, we need to update the /etc/environment file to set the KAKFA_HOME path, and to add Kafka to the PATH environment variable itself:

KAFKA_HOME

Once we source the /etc/environment variable we should be ready to go:

KAFKA_HOME_Consumed

Setting up a Single Node, Single Broker Cluster

Introduction

In the previous section, we installed Kafka on a single machine. In this section we will setup a single node – single broker Kafka cluster, as is shown in the figure below:

SingleNodeKafkaCluster

Start ZooKeeper Service

Like I mentioned in the previous post, Kafka uses ZooKeeper, so  we first need to start  a ZooKeeper server.  The setting for ZooKeeper are maintained in the config/zookeeper.properties file. The more relevant parts of this file are shown below:

By default, the ZooKeeper server will listen on the 2181 TCP port. For detailed information on setting up ZooKeeper servers, please visit  http://zookeeper.apache.org. Start the ZooKeeper server as follows:

The ZooKeeper server will startup, as is shown in the figure below:

ZooKeeperStart

I recommend that you use different terminal windows, that way you can keep track of what is happening across the different servers that you will be starting as part of this post.

Start the Kafka Broker

Now we are ready to start the Kafka broker. The configuration properties of the broker are defined in the config/server.properties configuration file. The most relevant portions of this file are shown below:

Here, we see that the id of the broker is set to zero. All brokers in a cluster will be assigned sequential id’s starting with zero. The broker itself will listen on port 9092, and it will store its logs in the /tmp/kafka-logs directory. We also see that the connectivity string for ZooKeeper, using the same client port 2181.

To start the broker, launch a new terminal window, and issue the following command:

The broker will start, showing its output in the terminal window:

brokerStart

Creating a Kafka Topic

Kafka provides a command line utility named kafka-topcs.sh to create topics on the server. To create a topic named telemetry with a single partition and only one replica. Start a new terminal instance and issue the following command:

The kafka-topic.sh utility will create the topci, and show a successful creation message:

KafkaTopicCreated

In the broker terminal window, we can watch the topic being created:

BrokerTopicCreated

The log for the new topic has been created in /tmp/kafka-logs/ as was specified in the config/server.properties file:

KafkaLogsCreated

To get a list of topics and any Kafka server, we can use the following command in our terminal window:

The utility will output the name of our telemetry topic as is shown below:

KafkaListTopics

At this point, we have:

  1. Started the ZooKeeper server, which provides the coordination services for our Kafka Cluster.
  2. Started a single Kafka broker.
  3. Created a topic

We are now ready  to start sending and receiving messages. Kafka provides some out-of-the-box utilities which can act and message producers and consumers, let’s move on to those next.

Starting a Producer to send messages

Kafka provides users with a command line producer client that accepts input from the command line and publishes them as a message to the Kafka cluster. By default, each new line is published as a new message. Start (yet another) new Terminal window, and issue the following command to start the producer:

The following parameters are required for the producer command line client:

  • The broker list. This is the list of brokers that we want to send the messages to. In this case we only have one broker. The format in which each broker should be specified in <node_address:port>, which is our case is localhost:9092, since we know our broker is listening on port 9092, as specified in config/server.properties.
  • The second parameter is the name of the topic, which in our case is telemetry.

The producer will wait on input from stdin. Go ahead and type a few lines of messages, as is shown below:

KafkaProducer

The default properties for the producer are defined in the  config.producer.properties file. The most important properties are listed below:

The metadata.broker.list property allows us to specify the list of brokers. In our case we only have one broker at port 9092. At the moment, we are not using any compression, so we specify none as the compression codec.

In later posts, we will cover the details of writing producers for Kafka in detail.

Starting a consumer to consume messages

Kafka also provides a command line consumer client for message consumption. The following command is used to start a console-based consumer that shows the output in a terminal window as soon as it registers to the telemetry topic created in the Kafka broker:

When you used to producer in the previous section, you should get the following output:

KafkaConsumer

As one can see, the consumer will each the text typed in the producer. The default properties for the consumer are defined in the config/consumer.properties file. the most important properties are shown below:

We will cover consumer groups and the details of writing consumers in later posts.

Conclusion

By running all four components (zookeeper, broker, producer and consumers) in different terminals, we are able to enter messages from the producer’s terminal and see them appearing in the consumer’s terminal. We now have a very good understanding of the basics of Kafka.

In the next post, we will modify our configuration to use a single node – multiple broker setup.

 

 

 

Using the Hadoop HDFS Command Line

Introduction

The Hadoop File System (HDFS) includes an easy to use command line interface which you can use to manage the distributed file system.You will need access to a working Hadoop cluster in you want to follow along. If you don’t have direct access to a cluster you can create a local Hortonworks or Cloudera single-node cluster, as specified in my earlier post: getting started with Enterprise  Hadoop

Logging in to the cluster

Since we will be using the command line interface here, we need to be able to log into our cluster. In my case I am using the Hortonworks Sandbox VM, running in VirtualBox. On my local machine, I created the hostname hw_sandbox by adding the following entry to my C:\Windows\System32\drivers\etc\hosts file:

The generic syntax for using ssh is as follows:

This is what I use to log into my Hortonworks sandbox:

Note that, unlike me, you should not be using the root user  name if you are on a real cluster! Winking smile

Generic Command Structure

When you look for the right command line keywords you us, you might get different answers. Three different command lines as supported. They appear to be very similar but they have minute differences:

  1. hadoop fs <arguments>
  2. hadoop dfs <arguments>
  3. hdfs dfs <arguments>

Let’s take a look at the command structure for each one. First the hadoop fs command line:


FS relates to any generic file system which could be local, HDFS, DFTP, S3 FS etc.. Most of the Hadoop distributions are configured to default to HDFS  if you specify hadoop fs. You can explicitly specify HDFS in the hadoop command line, like so:


This explicitly specifies HDFS, and would work for any HDFS operations.However, this command has been deprecated. You should use the hdfs dfs syntax instead, as is shown below:


This syntax works for all operations against HDFS in it is the recommend command to use instead of hadoop fs or hadoop dfs. Indeed, if you use the hadoop dfs syntax, it will locate hdfs and delegate the command to hdfs dfs.

Creating a directory in HDFS

The syntax for creating a directory is shown below:

You can specify a path using two different syntax notations, as is show below:

  • You can use a standard local path (for example /user/bennie)
  • You can use the hdfs:// URL syntax (for example hdfs://sandbox.hortonworks.com/user/bennie).

Note: depending on the configuration of your cluster, you might have to specify a port in addition to the host name.

Below is an example of these commands running in the Hortonworks sandbox bash shell:

Uploading a file and Listing Contents

Let’s create a local file, we’ll use echo in this case:


You have a variety of commands available to upload a file to HDFS, I am using the put command here:


Next, we can list the contents of our directory with the ls command. Again, we can use the standard file system notation,  or we can use the hdfs:// notation, as is shown below:


Below is an example of these commands running in the shell:

Downloading a File from HDFS

We can use the get command to download a file from HDFS to our local file system:


Below is an example of these command running in the shell:

Getting Started with Enterprise Hadoop

So, you decided that you would like to learn Hadoop, but you have no idea where to get started? You could go to the Apache Hadoop download page, download the most recent distribution and install it on your system. But be forewarned, doing a local install directly from the apache distribution is not for the faint of heart, and can be extremely time consuming.

As easy way to get started with Big Data and Enterprise Hadoop is to leverage the sandbox virtual machine images offered by both Cloudera and Hortonworks. In this post, I will perform a quick walk through of both options.

Using the Hortonworks Sandbox

An easy way to get started with Big Data and Enterprise Hadoop is to leverage the Hortonworks Sandbox. The sandbox is a personal, portable Hadoop environment, containing the latest HDP distribution which runs in a virtual machine.

You have two deployment options with the Hortonworks Sandbox:

  1. You can deploy it as a standalone VM on your local computer.
  2. You can run the Sandbox as a cloud-based virtual machine.

I recommend using the first option if you have a more powerful local computer, ideally with 8 GB of RAM (or more, I would recommend at least 16 GB). The cloud-based option is more appropriate if you have a smaller system, or if you need access to the sandbox from multiple devices.

If you decide to install the sandbox locally, you have a choice of three different virtualization environments:

  • Oracle’s VirtualBox.
  • VMWare
  • Hyper-V on Windows

I deployed the sandbox on VirtualBox, and it was a very quick and easy process, although you do need to allow for some time to download the image from the Hortonworks Web site.

Using the Cloudera QuickStart VMs

The Cloudera Quickstart is very similar to the Hortonworks Sandbox offering. The QuickStart VMs contain a single-node Apache Hadoop cluster, complete with example data, queries scripts and Cloudera manager.

The Quickstart supports the following virtualization environments:

  • Oracle’s VirtualBox.
  • VMWare
  • KVM

I deployed the QuickStart on my local machine’s VirtualBox, and again the process was very straight forward.

Conclusion

Both the Hortonworks Sandbox and the Cloudera QuickStart are a great, painless way of getting started with Enterprise Hadoop. I noticed that I don’t just use them for learning, but also for experimenting with new projects and new features in a risk-free environment. I would definitely recommend giving either (or both) a spin. Which one you would want to use depends on your platform preferences. I tend to like a more “pure” Apache distribution without any proprietary addon’s so I tend to work more with the Hortonworks Sandbox.