Genie Ansible Playbook for EMR
Genie is the NetflixOSS Hadoop Platform as a Service. It provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
Launch an EMR cluster Genie
If you don't already have one, create a new Key Pair, and add it to your keychain or SSH agent so you don't need to specify it later:
$ ssh-add mykey.pem
Launch an Elastic MapReduce (EMR) JobFlow using the above Key Pair
- Use the 2.4.2 AMI
- Make sure the master node is at least an
m1.mediumso that Tomcat has enough RAM to run
- Get EMR to install Hive 0.11
- Get EMR to Install Pig 0.11.1
- modify the
ElasticMapReduce-mastersecurity group and allow port 7001 access from your IP address only
- OR, set up a proxy to the Elastic MapReduce master and access it that way
- modify the
- Go to the EC2 page, and set the
Nametag of the master node to
- Confirm you can see the instance using the Ansible EC2 inventory
$ /etc/ansible/hosts | grep 'Genie'
Run Ansible playbook
OK, you are now ready to install Genie on the master node of the EMR JobFlow.
$ ansible-playbook playbooks/genie-hadoop-emr.yml -l 'tag_Name_Genie'
This will configure the master node to be running the latest snapshot build of Genie. If you prefer to build your own WAR file yourself, just specify the path to the WAR file:
$ ansible-playbook playbooks/genie-hadoop-emr.yml -l 'tag_Name_Genie' -e "local_war=/path/to/genie.war"
Once the playbook is finished, you will have Genie running inside Tomcat on your EMR master node. You can access it via HTTP. Example:
- At Netflix, Genie is run as a standalone service outside of an EMR cluster. Each cluster then registers itself with the main Genie service. This just gives you a quick way to test it out on a single cluster.
If you have feedback, comments or suggestions, please feel free to contact Peter at Answers for AWS, create an Issue, or submit a pull request.