A bit of background on Ensemble from their website:
- Ensemble is a next generation service orchestration framework. It has been likened to APT for the cloud. With Ensemble, different authors are able to create service formulas independently, and make those services coordinate their communication through a simple protocol. Users can then take the product of different authors and very comfortably deploy those services in an environment. The result is multiple machines and components transparently collaborating towards providing the requested service.
I come from a DevOps background and know first hand the troubles and tribulations of deploying production services, webapps, etc. One that's particularly "thorny" is hadoop.
To deploy a hadoop cluster, we would need to download the dependencies ( java, etc. ), download hadoop, configure it and deploy it. This process is somewhat different depending on the type of node that you're deploying ( ie: namenode, job-tracker, etc. ). This is a multi-step process that requires too much human intervention. It is also a process that is difficult to automate and reproduce. Imagine 10, 20 or 50 node cluster using this method. It can get frustrating quickly and it is prone to mistake.
With this experience in mind ( and a lot of reading ), I set out to deploy a hadoop cluster using an Ensemble formula.
First things first, let's install Ensemble. Follow the Getting Started documentation on the Ensemble site here.
According to the Ensemble documenation, we just need to follow some file naming conventions for what they call "hooks" ( executable scripts in your language of choice that perform certain actions ). These "hooks" control the installation, relationships, start, stop, etc of your formula. We also need to summarize the description of the formula in a file called metadata.yaml. The metadata.yaml file describes the formula, it's interfaces, what it requires and provides among other things. More on this file later when I show you the one for hadoop-master and hadoop-slave.
Armed with a bit of knowledge and a desire for simplicity, I decided to split the hadoop cluster in two:
- hadoop-master (namenode and jobtracker )
- hadoop-slave ( datanode and tasktracker )
One of my colleagues, Brian Thomason did a lot of packaging for these formulas so, my job is now easier. The configuration for the packages has been distilled down to three questions:
- namenode ( leave blank if you are the namenode )
- jobtracker ( leave blank if you are the jobtracker )
- hdfs data directory ( leave blank to use the default: /var/lib/hadoop-0.20/dfs/data )
- echo debconf hadoop/namenode string ${NAMENODE}| /usr/bin/debconf-set-selections
- echo debconf hadoop/jobtracker string ${JOBTRACKER}| /usr/bin/debconf-set-selections
- echo debconf hadoop/hdfsdatadir string ${HDFSDATADIR}| /usr/bin/debconf-set-selections
Thanks to Brian's work, I now just have to install the packages ( hadoop-0.20-namenode and hadoop-0.20-jobtracker). Let's put all of this together into an ensemble formula.
- Create a directory for the hadoop-master formula ( mkdir hadoop-master )
- Make a directory for the hooks of this formula ( mkdir hadoop-master/hooks )
- Let's start with the always needed metadata.yaml file ( hadoop-master/metadata.yaml ):
name: hadoop-master
revision: 1
summary: Master Node for Hadoop
description: |
The Hadoop Distributed Filesystem (HDFS) requires one unique server, the
namenode, which manages the block locations of files on the
filesystem. The jobtracker is a central service which is responsible
for managing the tasktracker services running on all nodes in a
Hadoop Cluster. The jobtracker allocates work to the tasktracker
nearest to the data with an available work slot.
provides:
hadoop-master:
interface: hadoop-master
- Every Ensemble formula has an install script ( in our case: hadoop-master/hooks/install ). This is an executable file in your language of choice that ensemble will run when it's time to install your formula. Anything and everything that needs to happen for your formula to install, needs to be inside of that file. Let's take a look at the install script of hadoop-master:
#!/bin/bash
# Here do anything needed to install the service
# i.e. apt-get install -y foo or bzr branch http://myserver/mycode /srv/webroot
##################################################################################
# Set debugging
##################################################################################
set -ux
ensemble-log "install script"
##################################################################################
# Add the repositories
##################################################################################
export TERM=linux
# Add the Hadoop PPA
ensemble-log "Adding ppa"
apt-add-repository ppa:canonical-sig/thirdparty
ensemble-log "updating cache"
apt-get update
##################################################################################
# Calculate our IP Address
##################################################################################
ensemble-log "calculating ip"
IP_ADDRESS=`hostname -f`
ensemble-log "Private IP: ${IP_ADDRESS}"
##################################################################################
# Preseed our Namenode, Jobtracker and HDFS Data directory
##################################################################################
NAMENODE="${IP_ADDRESS}"
JOBTRACKER="${IP_ADDRESS}"
HDFSDATADIR="/var/lib/hadoop-0.20/dfs/data"
ensemble-log "Namenode: ${NAMENODE}"
ensemble-log "Jobtracker: ${JOBTRACKER}"
ensemble-log "HDFS Dir: ${HDFSDATADIR}"
echo debconf hadoop/namenode string ${NAMENODE}| /usr/bin/debconf-set-selections
echo debconf hadoop/jobtracker string ${JOBTRACKER}| /usr/bin/debconf-set-selections
echo debconf hadoop/hdfsdatadir string ${HDFSDATADIR}| /usr/bin/debconf-set-selections
##################################################################################
# Install the packages
##################################################################################
ensemble-log "installing packages"
apt-get install -y hadoop-0.20-namenode
apt-get install -y hadoop-0.20-jobtracker
##################################################################################
# Open the necessary ports
##################################################################################
if [ -x /usr/bin/open-port ];then
open-port 50010/TCP
open-port 50020/TCP
open-port 50030/TCP
open-port 50105/TCP
open-port 54310/TCP
open-port 54311/TCP
open-port 50060/TCP
open-port 50070/TCP
open-port 50075/TCP
open-port 50090/TCP
fi
- There a few other files that we need to create ( start and stop ) to get the hadoop-master formula installed. Let's see those files:
- start
#!/bin/bash
# Here put anything that is needed to start the service.
# Note that currently this is run directly after install
# i.e. 'service apache2 start'
set -x
service hadoop-0.20-namenode status && service hadoop-0.20-namenode restart || service hadoop-0.20-namenode start
service hadoop-0.20-jobtracker status && service hadoop-0.20-jobtracker restart || service hadoop-0.20-jobtracker start
- stop
#!/bin/bash
# This will be run when the service is being torn down, allowing you to disable
# it in various ways..
# For example, if your web app uses a text file to signal to the load balancer
# that it is live... you could remove it and sleep for a bit to allow the load
# balancer to stop sending traffic.
# rm /srv/webroot/server-live.txt && sleep 30
set -x
ensemble-log "stop script"
service hadoop-0.20-namenode stop
service hadoop-0.20-jobtracker stop
Let's go back to the metadata.yaml file and examin it in more detail:
ensemble: formula
name: hadoop-master
revision: 1
summary: Master Node for Hadoop
description: |
The Hadoop Distributed Filesystem (HDFS) requires one unique server, the
namenode, which manages the block locations of files on the
filesystem. The jobtracker is a central service which is responsible
for managing the tasktracker services running on all nodes in a
Hadoop Cluster. The jobtracker allocates work to the tasktracker
nearest to the data with an available work slot.
provides:
hadoop-master:
interface: hadoop-master
The emphasized section ( provides ) tells ensemble that this formula provides an interface named hadoop-master that can be used in relationships with other formulas ( in our case we'll be using it to connect the hadoop-master with the hadoop-slave formula that we'll be writing a bit later ). For this relationship to work, we need to let Ensemble know what to do ( More detailed information about relationships in formulas can be found here ).
Per the Ensemble documentation, we need to name our relationship hooks hadoop-master-relation-joined and it should also be an executable script in your language of choice. Let's see what that file looks like:
#!/bin/sh
# This must be renamed to the name of the relation. The goal here is to
# affect any change needed by relationships being formed
# This script should be idempotent.
set -x
ensemble-log "joined script started"
# Calculate our IP Address
IP_ADDRESS=`hostname -f`
# Preseed our Namenode, Jobtracker and HDFS Data directory
NAMENODE="${IP_ADDRESS}"
JOBTRACKER="${IP_ADDRESS}"
HDFSDATADIR="/var/lib/hadoop-0.20/dfs/data"
relation-set namenode="${NAMENODE}" jobtracker="${JOBTRACKER}" hdfsdatadir="${HDFSDATADIR}"
echo $ENSEMBLE_REMOTE_UNIT joined
Your formula directory should now look something like this:
hadoop-masterThis formula should now be complete... It's not too exciting yet as it doesn't have the hadoop-slave counterpart to it but, it is a complete formula.
hadoop-master/metadata.yaml
hadoop-master/hooks/install
hadoop-master/hooks/start
hadoop-master/hooks/stop
hadoop-master/hooks/hadoop-master-relation-joined
The latest version of the hadoop-master formula can be found here if you want to get it.
The hadoop-slave formula is almost the same as the hadoop-master formula with some exceptions. Those I'll leave as an exercise for the reader.
The hadoop-slave formula can be found here if you want to get it.
Once you have both formulas ( hadoop-master and hadoop-slave ) you can easily deploy your cluster by typing:
- ensemble bootstrap # ( creates/bootstraps the ensemble environment)
- ensemble deploy --repository . hadoop-master # ( deploys hadoop-master )
- ensemble deploy --repository . hadoop-slave # ( deploys hadoop-slave )
- ensemble add-relation hadoop-slave hadoop-master # ( connects the hadoop-slave to the hadoop-master )
To add another node to this existing hadoop cluster, we add:
- ensemble add-unit hadoop-slave # ( this adds one more slave )
Ensemble allows you to catalog the steps needed to get your service/application installed, configured and running properly. Once your knowledge has been captured in an ensemble formula, it can be re-used by you or others without much knowledge of what's needed to get the application/service running.
In the DevOps world, this code re-usability can save time, effort and money by providing self contained formulas that provide a service or application.

[...] you’re interested to learn more about exactly how this “magic” works, check out this indepth guide dissecting how the hadoop Ensemble formulas exactly work by non-other than Juan Negron, the formula [...]
Great writeup Juan! Thanks for all the details
[...] minute thing” using Ensemble! Since Ensemble now has formulas for creating hadoop master and… Read more… Categories: Linux Share | Related [...]
[...] Read thе rest here: Hadoop cluster wіth Ubuntu server аnԁ Ensemble | Ubuntu Cloud Portal [...]
[...] Hadoop cluster with Ubuntu server and Ensemble [...]
[...] you’re interested to learn more about exactly how this “magic” works, check out this indepth guide dissecting how the hadoop Ensemble formulas exactly work by non-other than Juan Negron, the formula [...]
[...] Hadoop cluster with Ubuntu server and Ensemble | Ubuntu Cloud PortalTo set up a Hadoop cluster, you must first choose at least four machines in the Berry patch. During testing you do not have to worry about whether others … [...]