Getting Started with Enterprise Hadoop

So, you decided that you would like to learn Hadoop, but you have no idea where to get started? You could go to the Apache Hadoop download page, download the most recent distribution and install it on your system. But be forewarned, doing a local install directly from the apache distribution is not for the faint of heart, and can be extremely time consuming.

As easy way to get started with Big Data and Enterprise Hadoop is to leverage the sandbox virtual machine images offered by both Cloudera and Hortonworks. In this post, I will perform a quick walk through of both options.

Using the Hortonworks Sandbox

An easy way to get started with Big Data and Enterprise Hadoop is to leverage the Hortonworks Sandbox. The sandbox is a personal, portable Hadoop environment, containing the latest HDP distribution which runs in a virtual machine.

You have two deployment options with the Hortonworks Sandbox:

  1. You can deploy it as a standalone VM on your local computer.
  2. You can run the Sandbox as a cloud-based virtual machine.

I recommend using the first option if you have a more powerful local computer, ideally with 8 GB of RAM (or more, I would recommend at least 16 GB). The cloud-based option is more appropriate if you have a smaller system, or if you need access to the sandbox from multiple devices.

If you decide to install the sandbox locally, you have a choice of three different virtualization environments:

  • Oracle’s VirtualBox.
  • VMWare
  • Hyper-V on Windows

I deployed the sandbox on VirtualBox, and it was a very quick and easy process, although you do need to allow for some time to download the image from the Hortonworks Web site.

Using the Cloudera QuickStart VMs

The Cloudera Quickstart is very similar to the Hortonworks Sandbox offering. The QuickStart VMs contain a single-node Apache Hadoop cluster, complete with example data, queries scripts and Cloudera manager.

The Quickstart supports the following virtualization environments:

  • Oracle’s VirtualBox.
  • VMWare
  • KVM

I deployed the QuickStart on my local machine’s VirtualBox, and again the process was very straight forward.

Conclusion

Both the Hortonworks Sandbox and the Cloudera QuickStart are a great, painless way of getting started with Enterprise Hadoop. I noticed that I don’t just use them for learning, but also for experimenting with new projects and new features in a risk-free environment. I would definitely recommend giving either (or both) a spin. Which one you would want to use depends on your platform preferences. I tend to like a more “pure” Apache distribution without any proprietary addon’s so I tend to work more with the Hortonworks Sandbox.