Digital Transformation
Showing results for 
Search instead for 
Did you mean: 

Cloudera CTO on bumping your data to first class


Amr-Awadallah.jpgThe latest issue of Discover Performance includes an interview with Amr Awadallah, CTO of prominent Hadoop vendor Cloudera, about big data challenges in 2013. In that article, Awadallah shares his thoughts on the subtle shifts ahead in 2013 for organizations struggling to contain big data.


Here, we have an outtake in which Awadallah discusses why a hierarchical approach to data is essential in a big data world, and why we have to wait a little bit longer to get help with big data governance. 


Q: How do organizations decide whether to keep their data in a traditional data warehouse or try a new approach?


Amr Awadallah: I like to use an airplane analogy. Not all of your data deserves to fly in first class. Obviously, your most critical data, your most imperative data, should be in first class, and deserves to bypass the long lines and arrive super quickly. The high cost to store and process that data is justified since the value of that data is extremely high. However, with big data, you are collecting and keeping all of the raw, granular events; hence there is a ton of bytes. Though the value of that data in aggregate is very large, the value on a per-byte basis is very small, due to the sheer volume; hence you can’t justify the cost of flying that data in first class. You move it to economy class, where it will arrive a bit behind but at a much lower cost.


The other key differentiating aspect is agility in handling unstructured data. Traditional data warehouses employ a schema-on-write model and have proprietary storage formats. That makes them very good at answering the “unknown known” questions. These are questions that we can model the schema for ahead of time. The new approach employs a schema-on-read model instead, which means it can accept files of any type in their native format. The data can arrive in any form, and we can infer the schema at a later point in time when we want to ask a new, truly unknown question for the very first time, which is excellent for data exploration.


Q: Does this hierarchical assessment of data get into the issue of data governance? Finding out which data is highest risk, and then automating how it is stored, protected, and deleted based on that risk?


AA: Yes and no. Because our system, Hadoop, and other data platforms are kind of new, they don’t have rich governance tools around them yet. However, the industry is building better governance tools around these systems.


Traditional relational databases already have governance, they already have automation. We need to do the exact same things for big data systems, with just two new quirks added in: first, the ability to deal with unstructured data; and second, how to scale with the ability to spread our workloads over hundreds upon thousands of nodes. Everything else is exactly the same. Everything that already happened in the traditional database world, we need to have happen again for this new big data model.


We are getting there, but we are not there yet. That’s why today governance is a factor, but over time this will not be a factor anymore. Many of the pieces are falling into place already.


For more from Awadallah, read the full article in the latest issue of Discover Performance. Subscribe to Discover Performance to get more insights on IT strategy and performance delivered to your inbox.

0 Kudos
About the Author


This account is for guest bloggers. The blog post will identify the blogger.

Jan 30-31, 2018
Expert Days - 2018
Visit this forum and get the schedules for online HPE Expert Days where you can talk to HPE product experts, R&D and support team members and get answ...
Read more
See posts for dates
HPE Webinars - 2018
Find out about this year's live broadcasts and on-demand webinars.
Read more
View all