Software Developers
Showing results for 
Search instead for 
Do you mean 

Big Data, big choices

kolga ‎07-18-2013 01:02 AM - edited ‎07-18-2013 01:17 AM

Everyone is talking about ‘Big Data’ nowadays. We see more and more examples of non-software organizations that are beginning to understand the importance of Big Data analytics for their business success, and that are even defining data analytics as their core business. To quote, for example, General Electric’s 2012 Annual Report: “We are making a major investment in software and analytics. We know that industrial companies need to be in the software business.” This statement is pretty remarkable coming from an industrial company such as GE. There are many more examples – from ad firms to city management.

As technology advances, the approach towards building Big Data applications changes, depending on the prevailing bottleneck (whether it is network, storage or processing power), alternating  between virtualization and bare-metal, central and distributed deployments, on-premise and cloud. In this post we will look at some of today’s challenges and choices organizations face when building Big Data applications.


Data centers optimized for Big Data workloads

Big Data has become synonymous with distributed massive scale processing, real-time event streaming, or advanced analytics technologies. But Big Data applications are not only about software. Running Big Data workloads in a cost effective way requires a new type of data center optimized for a specific type of workload. These new data centers use a pick and choose approach to remove extraneous components that add to the purchase and running cost of the system. The goal is to design the system infrastructure to match the demands of specific workloads as a way of reducing costs.

For example, HP Moonshot servers optimized for a hyperscale web load, can run low performance CPUs with shared power, cooling and networking infrastructure, and as a result take up less space and require less energy in the data center.


Centralized vs. distributed

Building geographically distributed deployments for Big Data implies that there should be a way to rapidly move data between data centers. Moving big data between data centers is extremely resource intensive, recognizing that network bandwidth is the most precious resource in a data center environment. This is the reason that data locality is at the heart of the Hadoop implementation. In their engineering blog, “Moving an Elephant: Large Scale Hadoop Data Migration at Facebook”, Facebook describes the challenges they faced when moving their Hadoop data between two data centers.


Moving big data *rapidly* between data centers is an even bigger challenge. Facebook and Google – operating on uniquely massive scales – have to deal with the problem of geographically distributed clusters. There have been publications about a Facebook’s ‘Project Prism’ and ‘Google Spanner’, but none of these projects have been open sourced yet, and are still kept as proprietary solutions. One day someone may implement a solution based on Google Spanner papers.


In addition, Big Data often requires building complex data centers with massive storage and processing power. Allowing each of the geographically distributed business units to manage such complex data centers individually, drives up the cost of the solution.


To conclude, when it comes to analyzing big data, data distribution is not a cost effective option at this point, if most of the distributed data is to be constantly available for consumption by the applications.


Big Data in the cloud

Taking into consideration the advantages of the centralized approach for Big Data applications, and the fact that new types of data centers should be built for Big Data workloads, the cloud solution for Big Data becomes a very attractive – and often the only viable – option for many organizations. With cloud solutions, organizations can analyze massive data sets without making a significant capital investment in hardware and management tools. In addition, Big Data cloud solutions often reduce some of the complexity, enabling more immediate deployment of big data technologies without the lengthy process of acquiring new skills and training. It is also helpful for developers, who can access preconfigured sandbox environments without having to set up the necessary configurations from scratch.


Finding meaningful context

At the end of the day, the ability to process Big Data is only a small part of the solution. It is all about making sense of the data and providing smart analytics layers that generate actionable insights. Big data is about finding answers to business questions through data. The challenge lies in understanding which data is significant and which is not, depending on the context of the question. And once you have found the significant parts of the data, you still need to understand its meaning. The fact that certain metrics correlate does not explain why  they correlate nor the causality of the correlation (i.e. which metric is the cause and which the symptom).


Kenneth Cukier and Viktor Mayer-Schonberger write in their book “Big Data - A Revolution That Will Transform How We Live, Work, and Think”:


“Big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more.”


In order “to do things at a large scale” one needs to choose wisely how to build applications, which technology to use, and be ready to re-think their choices as the technology advances.

About the Author


Seasoned architect with over 12 years of experience in the enterprise software business, contributing to setting the roadmap / vision, high level architecture and technology review, innovation management and product integration within HP Software portfolio.

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
1-3 December 2015
Discover 2015 London
Discover 2015 in London, the ultimate showcase technology event for business and IT professionals to learn, connect, and grow.
Read more
November 2015
Software Online Expert Days
Join us online to talk directly with our Software experts.
Read more
View all