Servers: The Right Compute
cancel
Showing results for 
Search instead for 
Did you mean: 

The Future of Big Data Platforms: Bringing order to chaos and proving there is a better way.

KristenReyes

Guest blog written by Greg Battas, Big Data Chief Technologist, CS Appliances Systems and Solution

 

You have built a rational systems architecture to support your enterprise. Years ago you quit building a server for each new application. Then in the corner of your data center appears a Hadoop cluster. Then another. Then an HBase cluster. Each brings with it a different compute to storage ratio, different memory configurations and different storage types. On each system’s internal, direct-attached storage sits three copies of Big Data, much of which is copied across multiple clusters. Does this sound familiar? Is this as good as it gets?

 

Big data, and the emerging technology that supports it, brings a great deal of promise, but also brings some complex challenges as it becomes more widespread. Hadoop, Cassandra, Spark and other big data projects are built on the idea that we “take the work to the data”, which means that they mandate systems with local direct-attached storage. Growing those systems requires data to be repartitioned and distributed to new disks. Having these systems share information is limited by the networks that must be used to copy “big” data from one cluster to another. Many of the benefits of traditional converged systems, such as the ability to scale compute and storage separately and the ability of multiple systems to share data, are lost. But now that big data is growing from toddler to adolescent, we need to think about how we evolve the system architectures.

 

So what trends do we see today and over the next few years that will affect Big Data system architecture?

 

  • East/West network fabrics – Hadoop was developed when the standard interconnect was 100Mb or 1Gb Ethernet. Today, most clusters are 10GbE, moving to 40GbE and beyond. More importantly, HP believes that, over the next few years, we will see significant movement in east/west fabrics due to several emerging technologies.
  • Multitemperate Software Defined Storage – Hadoop and other big data technologies have always been built around the idea of using software defined storage rather than proprietary storage arrays, but now we see SDS evolving into a bigger strategy that supports tiering of data by temperature across different distributed file systems, NoSQL’s and ObjectStores. Software Defined storage is moving from a niche use case to a strategy for many shops.
  • Container-based resource management – The big data world is moving toward a much more robust resource management model built on containers that are intelligently scheduled to execute on various nodes rather than VM’s that are manually placed by operators. Compared to a classic cloud style provisioning of a VM, these container models are more like an operating system for optimizing parallel workloads.
  • Workload Optimized Servers and acceleration – We see a new class of servers emerging driven by the rise in SOCs which are optimized to accelerate workloads using onboard silicon such as FPGAs, GPUs, DSPs and other adjunct processing. These types of servers yield much greater performance, density and power efficiency than traditional servers for the workloads they accelerate and big data is starting to find ways to use these technologies.

So how does all of this affect Big Data? My team within HP Converged Systems has been testing a different architecture for big data, using some of the workload-optimized servers that HP offers. What we have found is that it is possible to better than double the density of today’s traditional Hadoop cluster with substantially better price performance, while, at the same time, creating a single converged system that can allow Hadoop, Vertica, Spark and other big data technologies to share a common pool of data. We are releasing this as a set of HP Big Data reference architecture.

 

The HP Big Data Reference Architecture deploys a completely standard Hadoop distribution in an asymmetric fashion, running the storage-related components such as HDFS and HBase on HP ProLiant SL4540 Scalable servers, and compute-related components running under Yarn on HP Moonshot System servers. The interconnect is standard Ethernet, and the protocols between compute and storage are native Hadoop (HDFS, HBase, etc). By using workload-optimized servers, we achieve significantly better efficiency and actually run faster than we do when we run compute and storage collocated. Modern Ethernet fabrics are more than capable of delivering more bandwidth than a server’s storage subsystem, so our storage nodes actually perform better when they are dedicated to running the filesystem. Hadoop does not try to achieve node locality in our configuration because data is never collocated with compute, but rack locality works exactly the same way as a traditional cluster, so as long as we scale within a rack, overall scalability is not affected.

 

Now that we have decoupled compute and storage, many of the traditional converged system benefits become possible again. We can scale compute and storage independently by simply adding compute nodes or storage nodes. We find that most workloads respond linearly to additional compute far beyond the “one spindle per core” rule that most shops use. In order to make this more flexible, we have worked with Hortonworks to create a new feature in Hadoop called Yarn Labels. Labels allows us to create pools of compute nodes where applications run, so it is possible to dynamically provision clusters without repartitioning data. Most interesting is that with labels, we can choose to deploy the Yarn containers onto compute nodes that are optimized and accelerated for each workload. In our initial configuration, we use the HP Moonshot System with HP ProLiant m710 Server Cartridge for Hadoop because it is extremely dense and cost effective, but also because it has an RDMA capable NIC that we use to accelerate shuffles and an Intel Iris GPU that might offload compression and other work into. Finally, with Hadoop 2.4 and releases from the distro vendors such as the Hortonworks Data Platform v2.2 release and upcoming Cloudera release, HDFS is adding storage tiering, so that we might have multiple pools of storage optimized for SSD, Disk, Archival and even in-memory.

 

Clearly, Big data software is evolving away from a simple model of deploying each application on its own cluster made up of a collection of identical nodes, and with the significant changes coming in fabrics, storage, container-based resource management and workload-optimized servers, we believe we are creating the next generation of data center-friendly, big data cluster with the HP Big Data Reference Architecture.

0 Kudos
About the Author

KristenReyes

Comments

An instructive post. People to really know who they want to reach and why or else, they'll have no way to know what they're trying to achieve. People need to hear this and have it drilled in their brains..

Thanks for sharing this great article.

Wish to Get a visit on canadian ecommerce company for more latest tips and news.