Behind the scenes @ Labs
Showing results for 
Search instead for 
Do you mean 

HP Labs and HP Vertica enhance R to simplify Big Data processing

‎03-06-2014 06:18 PM - edited ‎09-30-2015 07:02 AM

Contributed by Indrajit Roy, HP Labs principal researcher and technical lead for the Distributed R project


Editor’s Note: Distributed R began at HP Labs as a summer internship project in 2011. During the last three years a dedicated team of HP Labs researchers and HP Vertica developers has continued to work on the project and developed the technology to the point where it has now been transferred to HP Vertica’s marketplace for commercial use.



From left to right: HP Labs researchers Vanish Talwar, Alvin AuYoung, Rob Schreiber, and Indrajit Roy. Not in the photo: interns Shivaram Venkataraman, Erik Bodzsar, and Kyungyong Lee



icon_1068_1063.pngData scientists are key to unlocking actionable insights from data – a task that’s becoming increasingly complex as we tackle ever larger sets of both structured and unstructured information. At HP, we realize the need to empower data scientists in the ‘Big Data’ era. To that end, HP Vertica announced last month the debut of Distributed R, a platform developed in HP Labs to run complex machine learning, statistical analysis, and graph processing on a Big Data scale.


 Every data scientist has his or her favorite analysis tool. For the last decade, the statistical programming language R has been a popular choice – it’s open source and used by millions. However, R has multiple limitations when applied to Big Data. The main issue: R does not scale and it features almost no parallel algorithms. 


With Distributed R, we have overcome many of R’s limitations. Using the new platform, data scientists can continue to use the familiar R environment while benefiting from parallel algorithms and a scalable, high-performance environment. For data scientists unfamiliar with distributed programming, Distributed R simplifies how a cluster of servers can be used to complete analyses in a matter of minutes.


Distributed R started as an HP Labs summer internship project in 2011. Its aim was to run machine learning and graph algorithms on really large datasets, billions of records and terabyte-scale data. We succeeded in doing that and more, with the technology now being transferred to HP Vertica for commercial use.


HP customers can already use databases like HP Vertica to store and efficiently analyze data using SQL. With the addition of Distributed R, they can perform complex analyses on top of HP Vertica. For example, healthcare customers can use fast, ad-hoc queries in HP Vertica to perform patient analytics, discover business trends, and comply with regulations. To model patient health and predict complications, analysts may need to run clustering and classification algorithms that are not easily expressed in SQL. These algorithms can now be run using Distributed R.


While Distributed R can be used as a standalone platform with any backend store, the combination of HP Vertica and Distributed R has multiple benefits. Users can perform SQL analysis and pre-processing in HP Vertica, do their complex modeling in Distributed R, and run predictions in-database. This integrated approach offers a convenient way to deploy and manage the full life-cycle of data analysis.


Distributed R is currently in beta and available for free on the HP Vertica marketplace. HP Vertica and HP Labs are working closely to improve the software and add more features.


Our vision is to continue to develop the system as an open platform for data mining. We look forward to community engagement and welcome your contributions as we develop Distributed R further.


Sign up for the webinar about HP Vertica Distributed R on March 11. 


Photography by Serge Vejvoda

0 Kudos
About the Author


Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Jun 7-9
Las Vegas
Discover 2016 Las Vegas
Discover 2016 in Las Vegas, the ultimate showcase technology event for business and IT professionals to learn, connect, and grow.
Read more
Sep 13-16
National Harbor, MD
HPE Protect 2016
Protect 2016 is our annual conference and is the place to meet the world’s top information security talent, discuss new products and share information...
Read more
View all