Grounded in the Cloud
Showing results for 
Search instead for 
Did you mean: 

18 Hadoop Features You Can't Live Without


Guest Post: Sameer Nori, Sr. Product Marketing Manager, MapR


If you’ve been looking into an Apache Hadoop distribution for your big data, you’ve probably been hammered with lots of differences in opinion concerning the real necessities. This article details eighteen specific characteristics within four different categories that are critical in your final Hadoop distribution selection.




1. Data Ingest - Streaming Writes

Your data needs to arrive in your Hadoop cluster as quickly as possible. This means streaming writes should bear the task of loading and unloading data. Many Hadoop distributions use batch or semi-streaming processes that introduce inefficiencies with larger quantities of data.


2. Distributed Metadata Architecture

The original Hadoop architecture utilized a single NameNode to manage the cluster’s metadata. This was quickly recognized as a single point of failure (SPOF). Some distributions get around this by using a secondary NameNode as a backup in the case of a NameNode failure. While this is a step up from the default, it isn’t enough to qualify your system as high availability. Look for a distribution that instead uses a distributed metadata architecture. This removes the NameNode and completely eliminates the problem of a SPOF.


3. High Performance with Consistent Low Latency

Low latency is important and equally as important is its consistency. Refer to the graph below:




In this comparison of two Hadoop distributions, you can see that in some cases you can be subject to unpredictable volatility. Instead, be sure that your distribution provides high performance low latency on a consistent basis.


4. Access to Public Cloud Platforms

To ensure your ability to scale your data with the ever changing demands of technology, it is important that your distribution has the option of working with public cloud platforms such as Amazon Web Service or Google Compute Engine.




5. High Availability (HA)

Your chosen distribution should have High Availability features as a default in your Hadoop architecture. This includes a distributed metadata architecture. You should be confident that your system will be self-healing in the chance of multiple failures.


6. MapReduce HA

MapReduce jobs should be continuous regardless of a system failure. Automated failover and your ability to continue using your job and task trackers should not depend on manual restarts.


7. Rolling Upgrades

One of the best characteristics of Hadoop is its ability to be constantly evolving with the needs of the big data industry – this means the anticipation of many upgrades. Rolling upgrades allow you to take advantage of new system improvements without incurring any downtime. Do not settle with a distribution that doesn’t have this included in their architecture.


8. Data and Metadata Replication

By default, Hadoop will replicate your data three times. Use a distribution that not only replicates Hadoop’s file chunks, but also table regions and metadata. For additional protection, you should store one of the three copies on a different rack.


9. Point-In-Time Snapshots

Not all Hadoop snapshots are alike. Use a distribution that utilizes true point-in-time snapshots. This means that the snapshots capture a real representation of the data at the time it was taken. Many other distributions use the default HDFS snapshot system that will only capture closed data. Also be sure that your snapshot system is compatible with all Hadoop applications without needing to access the HDFS API. Finally, the snapshot system of an optimal distribution won’t require a duplication of your data. By sharing the same storage as your live information, you significantly reduce any impact on your performance and scalability.


10. Mirroring Disaster Recovery

Your system should anticipate catastrophic system failures to make recovery simple. Mirroring is the best preventative measure you can take to make disaster recovery not so disastrous. Your Hadoop distribution’s mirroring should be asynchronous and perform auto-compressed, block-level data transfer of differential changes. Your system should mirror both its data and metadata to ensure that applications are able to restart immediately upon site failure.




11. Comprehensive Management Tools

All Hadoop distributions have their own set of management tools. Take the necessary time to evaluate the management tools that any given distribution has to offer. Look at the breadth of the management toolbase and determine if it includes all of your management necessities.


12. Heat Maps, Alarms, Alerts

At any time you should easily be able to get a good grip on the condition of your Hadoop system. The monitoring capabilities of your Hadoop architecture should include heat maps, alerts and alarms that let you see the health, memory, CPU and other important metrics of your nodes at a glance.


13. Integration with REST API

In order to keep your connectivity open to different open source and commercial Hadoop tools, your architecture should integrate via a REST API.


Data Access


14. File System Access (POSIX)

A POSIX file system that supports random read/write operations as well as providing NFS access on Hadoop, will open your system up to far greater capacities than what is expected from the default HDFS.


15. File I/O

Some distributions have their system’s file input/output as append only. Use a distribution that allows a read/write of the File I/O.


16. Developer Productivity Tools

Considering your developer will be working frequently within your Hadoop platform, use a distribution that will make it easy for them to do so. They shouldn’t have to go through administrators for simple tasks like the creation of tables. Your distribution should also provide developer tools such as those that allow them to work directly with data on the cluster.


17. Security Features

Ensuring your data’s security is a high priority. Your distribution should include fine-grained permissions on file, directories, jobs, queues and administrative operations. Access control lists should also be default for tables, columns and column families.


18. Wire-Level Authentication with Kerberos and Native


Evaluate the priority of each of these essentials and take a look at the Hadoop distributions you are considering. There are many other features above the eighteen we have listed and if your distribution doesn’t includes these as a minimum, you may be jeopardizing the productivity of your big data investment.

Senior Manager, Cloud Online Marketing
0 Kudos
About the Author


I manage the HPE Helion social media and website teams promoting the enterprise cloud solutions at HPE for hybrid, public, and private clouds. I was previously at Dell promoting their Cloud solutions and was the open source community manager for OpenStack and at Rackspace and Citrix Systems. While at Citrix Systems, I founded the Citrix Developer Network, developed global alliance and licensing programs, and even once added audio to the DOS ICA client with assembler. Follow me at @SpectorID

Jan 30-31, 2018
Expert Days - 2018
Visit this forum and get the schedules for online HPE Expert Days where you can talk to HPE product experts, R&D and support team members and get answ...
Read more
See posts for dates
HPE Webinars - 2018
Find out about this year's live broadcasts and on-demand webinars.
Read more
View all