Software Developers
Showing results for 
Search instead for 
Do you mean 

Get your (BIG) data into shape!

IraCohen ‎10-06-2013 03:18 AM - edited ‎10-06-2013 03:19 AM

Post written by Efrat Egozi-Levi


“The shape of you the shape of me the shape of everything I see …

No shapes are ever quite alike …

Suppose you were shaped like these… or those … or shaped like a BLOGG!”



Did you know a BLOGG has a shape?!?!



This quote is from an innocent looking book by the infamous Dr. Seuss that got me thinking that indeed this is the case; indeed everything I see does have a shape which SURPRISINGLY is easily discernible.

In case you are wondering a BLOGG is shaped like this






The matter of shape takes us back a few years. It takes us back to our infancy when we couldn’t tell left from right or black from white. Yet, every day we saw billions of things—sometimes we saw the same things, sometimes completely different things. Yes! We spent a lifetime (so far) exposed to endless amounts of data that we continuously measure, monitor, compare and catalog all with the goal to continuously perfect our ability to identify shapes.


One can say we had to go through a LOT of data.


In a way we are now entering a similar phase for machines. We now have the ability to electronically monitor and accumulate all sorts of data for all sorts of things. Some may call this “Big Data” others may call it the “Internet of Things”, but the essence of both is the same.

This stems naturally from our increasing computational capability to embed software and electronic monitoring with our ever growing need and ability to monitor and accumulate data.


Here are some fun facts[i]:

  • When the Sloan Digital Sky Survey started working in 2000, its telescope collected more data in its first few weeks than had been amassed in the entire history of astronomy!
    • Now, a decade later, its archive contains 140 terabytes of information.
    • The Large Synoptic Survey Telescope, due to come on stream in Chile in 2016, will acquire that quantity of data every five days!
  • The world’s effective capacity to exchange information through telecommunication networks was:

281 petabytes in 1986,

2.2 exabytes in 2000,

65 exabytes in 2007,

and … it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013!



What does this overflowing amount of data have to do with our BLOGG?


Simply put, what we really want is to make sense of all of this data that we keep collecting. We don’t need the raw data that we are superabundantly collecting, but the shape of it—the patterns and information within.


What’s stopping us?

Most of the data is not indexed or organized in predefined models or any other manner (e.g. predefined fields or column names etc.), most of the data is unstructured data. It is estimated by Computer World analysts that more than 70-80 percent of all data in organizations is unstructured information[ii] .


One approach is harnessing domain experts and data scientists to analyze and mine the data on a need-to-know basis to find the “GOLD” and produce effective insights. This requires:

  • Going over the data
  • Understanding the content
  • Identifying usable patterns
  • Understanding what can be used and how to query it
  • Deciding what to calculate or index for the analysis
  • Designing a research plan
  • Choosing analytics tools to run them 
  • Analyzing the results and drawing actionable conclusions

In short, this requires a lot of manual labor. Moreover, the amounts of (unstructured) data are increasing exponentially whereas the human genius is not.


So how can we solve this problem?

It seems all we need to do is “make big data easier to use[iii]”.This sounds easy, but in reality it is not. Making it “easier to use” could be achieved by creating a more automated approach for extracting knowledge from unstructured data.

An example of this includes bridging the gap between the structured data and unstructured data by querying both the unstructured and structured worlds and then associating these two worlds at relevant points[iv]. This is useful when you assume that there is structured data relevant to the unstructured data and that the connection can be made. But what about the information that appears only in the unstructured data? How can we find and use that?

This brings us back to the heart of the problem. Usually, all these automated approaches require the same initial step of data transformation. The data transformation requires an ability to represent the data in a more simple way. In the case of unstructured data it is usually unknown or undefined which patterns are needed for the data transformation. This raises the need of finding patterns in unstructured data.

Can such patterns be found without human labor?

In my opinion – Yes it is possible to find patterns without human labor. Over time we calculate more, monitor more and store more and more. One may say that the human race loves to hoard data. We save copies of our data in abundance and naturally we use only a small fraction of it. But we are also the ones generating this data and there is nothing really random about it.

Keep in mind that all of this data has essentially been created (ongoing) by technology, developers and so on. This gives us enough reason to assume that this data is not random but rather that there are patterns and correlations within the data. These patterns change gradually over time, they can and need to be found, and subsequently used to get information and insights.


There are alliances being formed to create such solutions. Some of these alliances include:

  • The partnership between Autonomy and Vertica as part of HP HAVEn vision to bring a platform for transforming big data into business solutions[v]
  • Microsoft and oracle[vi]
  • Terada and Attensity[vii]  and so on.

The latter combines an active data warehouse system with text analytics applications that run the data through a series of Attensity extraction “engines” that parse and transform the data. The structured data is also processed and resides in a fused relational data warehouse. The HAVEn HP solution is similar to these, yet offers a wide and diverse suite of engines in Autonomy that can be applied to unstructured data, textual as well as non-textual, and is in the process of joining forces with the Vertica big database system and hadoop to scale out its potential.


We need to consider a more basic approach, possibly along the lines of “Deep Learning”. Deep learning is an emerging concept in Machine learning today. Its main forte is the utilization of cheap and vast computational resources to learn many simple features from an abundance of unlabeled data[viii]. These principles can be useful to automate the process of identifying simple patterns based on correlations or similarities.


The step of automatically transforming (unstructured/unlabeled) data into simple and actionable patterns is a BIG opportunity. This is an opportunity where we can leverage tools we already have and create new platforms and infrastructures to offer a suite of tools. These new tools can be used to find patterns and correlations in the data, resulting in identification of simple patterns and features. These identified patterns can be used to transform the data into structured patterns which can be further used in the data analysis and solution development process.


In a way, we want a gym for getting BIG UNSTRUCTURED data into shape, keeping in mind that “no shapes are ever quite alike”. When we get it in shape, we will be able to gain access to the “dark matter” that is our 70-80 percent unstructured data.


Get your data into shape … see you at the gym…





External Links


About the Author


Nov 29 - Dec 1
Discover 2016 London
Learn how to thrive in a world of digital transformation at our biggest event of the year, Discover 2016 London, November 29 - December 1.
Read more
Each Month in 2016
Software Expert Days - 2016
Join us online to talk directly with our Software experts during online Expert Days. Find information here about past, current, and upcoming Expert Da...
Read more
View all