Behind the scenes at Labs
Showing results for 
Search instead for 
Did you mean: 

HP Labs and HP Vertica enhance R to simplify Big Data processing


Contributed by Indrajit Roy, HP Labs principal researcher and technical lead for the Distributed R project


Editor’s Note: Distributed R began at HP Labs as a summer internship project in 2011. During the last three years a dedicated team of HP Labs researchers and HP Vertica developers has continued to work on the project and developed the technology to the point where it has now been transferred to HP Vertica’s marketplace for commercial use.



From left to right: HP Labs researchers Vanish Talwar, Alvin AuYoung, Rob Schreiber, and Indrajit Roy. Not in the photo: interns Shivaram Venkataraman, Erik Bodzsar, and Kyungyong Lee



icon_1068_1063.pngData scientists are key to unlocking actionable insights from data – a task that’s becoming increasingly complex as we tackle ever larger sets of both structured and unstructured information. At HP, we realize the need to empower data scientists in the ‘Big Data’ era. To that end, HP Vertica announced last month the debut of Distributed R, a platform developed in HP Labs to run complex machine learning, statistical analysis, and graph processing on a Big Data scale.


 Every data scientist has his or her favorite analysis tool. For the last decade, the statistical programming language R has been a popular choice – it’s open source and used by millions. However, R has multiple limitations when applied to Big Data. The main issue: R does not scale and it features almost no parallel algorithms. 


With Distributed R, we have overcome many of R’s limitations. Using the new platform, data scientists can continue to use the familiar R environment while benefiting from parallel algorithms and a scalable, high-performance environment. For data scientists unfamiliar with distributed programming, Distributed R simplifies how a cluster of servers can be used to complete analyses in a matter of minutes.


Distributed R started as an HP Labs summer internship project in 2011. Its aim was to run machine learning and graph algorithms on really large datasets, billions of records and terabyte-scale data. We succeeded in doing that and more, with the technology now being transferred to HP Vertica for commercial use.


HP customers can already use databases like HP Vertica to store and efficiently analyze data using SQL. With the addition of Distributed R, they can perform complex analyses on top of HP Vertica. For example, healthcare customers can use fast, ad-hoc queries in HP Vertica to perform patient analytics, discover business trends, and comply with regulations. To model patient health and predict complications, analysts may need to run clustering and classification algorithms that are not easily expressed in SQL. These algorithms can now be run using Distributed R.


While Distributed R can be used as a standalone platform with any backend store, the combination of HP Vertica and Distributed R has multiple benefits. Users can perform SQL analysis and pre-processing in HP Vertica, do their complex modeling in Distributed R, and run predictions in-database. This integrated approach offers a convenient way to deploy and manage the full life-cycle of data analysis.


Distributed R is currently in beta and available for free on the HP Vertica marketplace. HP Vertica and HP Labs are working closely to improve the software and add more features.


Our vision is to continue to develop the system as an open platform for data mining. We look forward to community engagement and welcome your contributions as we develop Distributed R further.


Sign up for the webinar about HP Vertica Distributed R on March 11. 


Photography by Serge Vejvoda

0 Kudos
About the Author


Online Expert Days - 2020
Visit this forum and get the schedules for online Expert Days where you can talk to HPE product experts, R&D and support team members and get answers...
Read more
HPE at 2020 Technology Events
Learn about the technology events where Hewlett Packard Enterprise will have a presence in 2020
Read more
View all