[Consistent] Big Data: An Application Blueprint for Hadoop

I’m lucky enough to still spend a large portion of my time actually speaking with customers. Conversations with customers are invaluable and always leave me with new perspectives. Of course, we talk about cloud computing, but occasionally the conversation will switch to the topic of big data. More often then not, customers big data strategies include Hadoop, a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework [1].

Companies like Google, Yahoo, Hulu, Adobe, Spotify, and Facebook are to some extent, powered by Hadoop [2]. Customers understand the success these companies have experienced due to their effective handling of big data and want to know how they can use Hadoop to replicate that success. Of course, this conversation about big data and Hadoop usually happens after we’ve already discussed Red Hat’s Open Hybrid Cloud vision and philosophy. It’s only natural, then, that they ask how Red Hat’s Open Hybrid Cloud can help them gain the benefits of Hadoop within the constraints of their existing IT infrastructure. Since I’ve had the conversation several times and it usually ends at the whiteboard, I thought it might be useful to sit down and actually show the unique way in which Red Hat’s Open Hybrid Cloud philosophy can be applied to adopting Hadoop. While I’ll be writing about Hadoop, this problem is not unique – it is often faced anytime developers go beyond that which enterprise IT can provide.
Developers: “I want it yesterday”
Good developers want to develop, great developers want to solve problems. For this reason, great developers often adopt technology early for the ways in which it allows them to solve problems that they couldn’t solve with existing technology. Take the scenario of developers adopting Hadoop. They might go out to their public cloud of choice and request a few instances and deploy Hadoop. Of course, deploying Hadoop is not always straightforward and requires some system administration skills. In this case, the developer might use a stack provisioning tool within the public cloud (such as AWS CloudFormations) to launch a pre-configured stack. Once Hadoop is running the developer starts developing.

Developers Deploying Hadoop in Public Cloud

System Administrators: “I want it stable”
Once development is complete, the developer sends a message to the system administrator letting them know that the new application is ready to go (we’ll skip change control, Quality Assurance, etc for simplicity). He provides his application code in the form of a package to the system administrator. The system administrator launches the corporate standard build on his virtual infrastructure or IaaS cloud in order to deploy the package the developer provided.

Deploying Public Cloud Development within the Datacenter

This scenario presents several issues:

There is no simple way to bring the pre-configured stacks or instances from the public cloud into the enterprise datacenter.
The instances in the public cloud were launched based on the configurations of the developer or whoever developed the stack. This is most likely not compliant with the standard build required by the organization for security or compliance reasons (think PCI, HIPAA, DISA STIG).
Even if the developer documented how to setup Hadoop thoroughly it causes the introduction of another manual step in the process, increasing the likelihood of human error.
The corporate standard build within the enterprise datacenter may vary greatly from the builds in the public cloud. This increases the chances of issues of incompatibility between the application and the infrastructure hosting it.

CloudForms: Delivering managed flexibility
Red Hat’s Open Hybrid Cloud approach can help solve these problems through the use of CloudForms which is based on the open source projects of Aeolus and Katello. CloudForms allows organizations to deploy and manage applications to multiple locations and infrastructure types based on a single application template and policy—maximizing flexibility while simplifying management. The image below provides an overview of how CloudForms changes the way in which Developers and System Administrators work together to achieve speed while maintaining control. Note that CloudForms can also be used to provide quotas, priorities, and other functions to the consumption of resources by developers. We’ll leave those topics for another post. Lets look at how CloudForms can solve this problem.

CloudForms provides managed self-service

The system administrator defines an application blueprint for Hadoop. This application blueprint is based on the corporate standard build and is portable across multiple cloud resource providers, such as Red Hat Enterprise Virtualziation (RHEV), VMware vSphere, and Amazon EC2. The system administrator places it in the appropriate catalog and allows access to the catalog by the developers.
The developer requests the application blueprint be launched into the Development cloud.
Since the system administrator defined the development cloud and added the a public cloud (Amazon EC2) as a cloud resource provider, CloudForms orchestrates the launch of the Hadoop application blueprint on the public cloud and returns information on how to connect to the instances comprising the running application.
The developer begins development, the same way they did previously.

The key takeaway is that developers experience is the same. They simply request and begin working. In fact, if the pre-configured application stacks in the public cloud provider didn’t yet exist the developers experience will have improved because the CloudForms, using the application blueprint’s services, handled all the complex steps to configure Hadoop.

In the previous scenario, when the developer sends a message to the system administrator that the application is ready for production the system administrator has to go through the cumbersome task of configuring Hadoop from scratch. This would likely include asking the developer questions such as “What version of the operating system are you running? Is SELinux running? What are the firewall rules?”. To which the developer might have likely respond, “SELinux? Firewall? Ummm … I think it’s Linux”. Now, with CloudForms and an Application Blueprint defined, the system administrator can request the same application blueprint be launched to the on-premise enterprise datacenter. With just a few clicks the same known quantity is running on-premise. The system administrator can now be certain that the development and production platforms are the same. Even better, the launched instances in both the public cloud and on-premise register with CloudForms for ongoing lifecycle management, ensuring the instances stay compliant.

Seamless transition from public cloud to on-premise deployment

Decoupling User Experience from Resource Provider
There is another benefit that might not be immediately apparent in this use case. By moving self-service to CloudForms the developers user experience has been decoupled from that of the resource provider. This means that if Enterprise IT wanted to shift development workload to another public cloud in the future, or move it on-premise, they could do so without the developers experience changing.

Application Blueprint for Hadoop
Want to see it in action? I spent a bit of time creating an application blueprint for Hadoop. You can find the blueprint, services, and scripts at GitHub. Here we go. Let’s imagine a developer would like to begin development of a new application using Hadoop. They simply point their web browser to the CloudForms self-service portal and log in.

Once in the self-service portal they select the catalog that has been assigned to them by the CloudForms administrator (likely the system administrator), provide a unique name for their application, and launch.

Within a minute or two (with EC2, often less) they receive a view of the 3 instances running that comprise their Hadoop development environment.

And off the developer can go. They can download their key and log into their instance. CloudForms even started all the services for them, take a look at the jobtracker and DFS health screens below from Hadoop.

Now, if you recall when the developer logged in to CloudForms only the development cloud was available. This was because the CloudForms administrator limited the clouds and catalogs the developer could use. Once the application is ready to move to production, however, the system administrator can log in and launch the same application blueprint to the production cloud. In this case, the production is running on Red Hat Enterprise Virtualization.

An important concept to grasp is that CloudForms is maintaining the images at the providers and keeping track of which component outline maps to which image. This means the system administrator could update his image for the underlying Hadoop virtual machines (aka instances) without having to throw away the application blueprint and start all over. This makes the life-cycle sustainable, something system administrators will really appreciate.

Also, if the system administrator wanted to change the instance sizes to offer smaller or larger instances or increase the number of instances in the hadoop environment they could do so with a few simple clicks and keystrokes.

So there you have it, consistent Hadoop deployments via application blueprints in CloudForms. Now, if you’d like to take things one step further, check out Matt Farrellee’s project which integrates hadoop with condor.

Non-Linked References
[1] http://wiki.apache.org/hadoop/
[2] http://wiki.apache.org/hadoop/PoweredBy

all things open

having fun with open source