Genie Ansible Playbook for EMR

View code on GitHub

Genie is the NetflixOSS Hadoop Platform as a Service. It provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.


You need Ansible and AWS set up an configured. This is a 10 minute process, and you can watch Episode 2 to see how to do it.

Launch an EMR cluster Genie

  1. If you don't already have one, create a new Key Pair, and add it to your keychain or SSH agent so you don't need to specify it later:

    $ ssh-add mykey.pem
  2. Launch an Elastic MapReduce (EMR) JobFlow using the above Key Pair

    • Use the 2.4.2 AMI
    • Make sure the master node is at least an m1.medium so that Tomcat has enough RAM to run
    • Get EMR to install Hive 0.11
    • Get EMR to Install Pig 0.11.1
  3. Either:
  4. Go to the EC2 page, and set the Name tag of the master node to Genie
  5. Confirm you can see the instance using the Ansible EC2 inventory
    $ /etc/ansible/hosts | grep 'Genie'

Run Ansible playbook

OK, you are now ready to install Genie on the master node of the EMR JobFlow.

$ ansible-playbook playbooks/genie-hadoop-emr.yml -l 'tag_Name_Genie'

This will configure the master node to be running the latest snapshot build of Genie. If you prefer to build your own WAR file yourself, just specify the path to the WAR file:

$ ansible-playbook playbooks/genie-hadoop-emr.yml -l 'tag_Name_Genie' -e "local_war=/path/to/genie.war"

Access Genie

Once the playbook is finished, you will have Genie running inside Tomcat on your EMR master node. You can access it via HTTP. Example:

Important Notes

  • At Netflix, Genie is run as a standalone service outside of an EMR cluster. Each cluster then registers itself with the main Genie service. This just gives you a quick way to test it out on a single cluster.


If you have feedback, comments or suggestions, please feel free to contact Peter at Answers for AWS, create an Issue, or submit a pull request.

View code on GitHub