Behind the scenes at Labs
cancel
Showing results for 
Search instead for 
Did you mean: 

Distributed Data Structures in R: How an idea became an artifact

Curt_Hopkins

rletter2.gif

By Curt Hopkins, Managing Editor, Hewlett Packard Labs

Every now and again, even in a research lab like Hewlett Packard Labs, where ideas can take years to become products, an obsession can sometimes flower quickly and produce technology that immediately alters how an industry does business.

An example of that process is Distributed Data Structures in R (ddR), a new set of computational primitives for distributed and parallel computing in R, an open source “language and environment for statistical computing and graphics.”

We have talked about Labs’ work on ddR a number of times here on Behind the Scenes @ Labs.

But one thing we never discussed was how ddR went from the glimmer in a researcher’s eye to a tool for the R community. So we asked two of the people behind this transition – Indrajit Roy, Principal Researcher at Hewlett Packard Labs, and Edward Ma, Software Engineer for HPE Vertica and Primary Developer for the ddR project – to tell the story.

Indrajit Roy and the open source community

indrajitsquare.gifR is a community. It’s an open source language. But we realized that R was slower than it should be. Super users write custom R programs while high-level users employ Excel. But mid-level users need a way to use R efficiently without taking the time out to create proprietary tools. That’s the challenge we are trying to solve with ddR.

For all open source products not owned by a company the only way of getting things done is by consensus. So we asked ourselves, how could we influence the R community? By bringing people together, of course.

So, after brainstorming with Michael Lawrence, an R-core member who works in Genentech, we decided to create an R package that will make it easy to interface with distributed systems. We proposed a development plan to the R community that included Labs funding the work, and told them that HPE Vertica would second two employees, Edward Ma and Vishrut Gupta, to work on it.

We worked on the development of ddR for almost five months, beginning in May of 2015, and were able to release first version of ddR in October 2015. We released all the code on GitHub which is an open code exchange platform.

Because of the nature of Labs research, very few projects are openly developed. However, for ddR, we embraced open source development, which underscores our commitment to the community. ddR is one of the few Labs projects that has feedback, and even contributions from other companies.

To keep transparency in the forefront, we attended multiple R meetups, including one at Microsoft, and presented out work to the community.

Since it was launched, over 2,500 users have downloaded ddR.

Distributed Data Structures in R is not a Labs or HPE product. The way to think about it is as an improvement to R, a layer that can be put on top of other systems. We were able to create ddR by leveraging both Labs’ experience and competencies, and by reaching across borders to work with R community members from different parts of HPE, and even different companies.

The cross-industry R Consortium is supporting the ddR project by providing us funding to take the next steps: enhancing the package and integrating it with Spark.

Edward Ma and the developers

biged.gifI joined HPE Vertica in late 2014, working on Distributed R, a high-performance platform for R, which was also spawned from the creative cauldron of Hewlett Packard Labs. (Distributed R is available as an open source download on Vertica’s GitHub page.)

We were a team of 10-15 people from Cambridge and the Bay Area that wanted to make R very scalable for people who wanted to do advanced analytics with the easy-to-use R.

We started to think about how we could make HPE a bigger contributor to the R community. We realized that, although Distributed R was a good platform, there were other competing solutions out there, like H2O and SparkR, which gave people similar performance and flexibility. This multitude of platform options created another problem. It made the APIs for programming parallel and distributed algorithms in R very fragmented. It was hard for people to learn all of these different APIs, especially if they wanted to try out different engines and backends. So they often just stuck to one platform.

Indrajit reached out to his connections in the R community, and together we started thinking about how they could standardize an API for parallel computing in R. We hosted the Workshop for Distributed Computing in R in January 2015, where we met with some heavyweights from the R community. That event planted the seed for this whole project.

Last summer, Indrajit approached me and asked me if I could be the main developer for ddR. I thought it was a very cool project and was excited to start. Over the next few months, slowly but surely, with contributions from Indrajit, Michael Lawrence of Genentech and Vishrut Gupta, my colleague at HPE Vertica, we pushed out a package. Last November, ddR got accepted to The Comprehensive R Archive Project (CRAN), along with several algorithm packages.

I am very excited that ddR may one day be integrated with the R language itself.

Photo by Steve Snodgrass

0 Kudos
About the Author

Curt_Hopkins

Managing Editor, Hewlett Packard Labs

Events
Online Expert Days - 2020
Visit this forum and get the schedules for online Expert Days where you can talk to HPE product experts, R&D and support team members and get answers...
Read more
HPE at 2020 Technology Events
Learn about the technology events where Hewlett Packard Enterprise will have a presence in 2020
Read more
View all