Behind the scenes at Labs
cancel
Showing results for 
Search instead for 
Did you mean: 

A software view of The Machine Project

ssinghal

 HPE20160726061_resized.jpg

 

 

 

 

 

 

 

 

 

 

 

Why does The Machine represents a new computing architecture?

The current mechanisms for scaling computing clusters treat memory as inherently joined at the hip to the processor. In order to scale processing, we add more and more processors in a horizontally scalable cluster, each with its own memory and I/O. The net result of this architecture is data fragmentation. To operate on large data sets, the architecture requires that the data be partitioned (or replicated) across the cluster. While this is not an issue when the cluster is used primarily for embarrassingly parallel computation, or for jobs where data can be cleanly partitioned, an increasing fraction of big-data workloads utilize multi-terabyte datasets where access patterns are more difficult to specify cleanly.

Unlike traditional clusters, The Machine’s Memory-Driven Computing architecture puts data at the center. The primary hypothesis behind The Machine is that next generation memory technologies will enable large (as in multi-Petabyte) pools of persistent memory to be created cost effectively. If we withhold judgement for the moment and assume that creation of such large memory pools is economically and technically viable, a re-design of the basic computing architecture becomes feasible.

Consider a large pool of memory that can be accessed in parallel at high (memory) speeds from each CPU in our distributed cluster. The memory pool allows all of the data to be held in one place, and any of thousands of computational cores to access any part of this data directly. Each processor node in The Machine runs an independent operating system, and can see all of the memory in the pool as if it was its own. However, the underlying operating system instances collaborate across the cluster, so shared regions of memory can be created within the pool, and enable applications that span multiple nodes to also share memory across those nodes. This change is at the heart of Memory-Driven Computing.

What does The Machine mean for software?

The Machine fundamentally changes three assumptions that most software writers make—memory resources are precious and need to be conserved; memory is volatile and persistent state needs to be preserved elsewhere; and finally, in large-scale applications, processes have to communicate across I/O networks using message passing. Even though these assumptions are not always explicit, much of computer science incorporates them in the way algorithms are designed and programs are written. Within The Machine, however, memory is abundant, so space efficiency within programs is no longer an issue; it is persistent, thus much of the software needed to ensure that state is retained in case of faults or power failure becomes redundant; and it is shared, implying that data partitioning, replication, and message passing may no longer be needed, except for archival and moving data into and out of the cluster.

If your reaction is to say “but that won’t work, because…”, welcome to what we at Hewlett Packard Labs first thought when we started down this journey three years ago.

One of the lessons we learned is that over the last fifty years, we have developed a set of principles and best practices that may no longer hold true with The Machine architecture. Recognizing that something which has always worked in the past may no longer be the best way of doing it is hard to internalize.

Let us take the three assumptions one at a time and see what they imply.

Memory is abundant

One of the first examples we worked with was searching untagged images. Given a database of images, and a query image, the problem is locate the top-5 images which are closest to the query. In the absence of any other information, we have to compare each image in the database with the given image, and locate the best matches. This problem is parallelizable by partitioning the data across our cluster, and merging the top-5 answers obtained by searching each subset. The Figure below shows the results of doing the problem 3 different ways. In the disk-based system (shown below in purple bars), we used a small Hadoop cluster, where the image data was disk resident. The in-memory version did the same, except that all images to be searched were now in memory. Note the improvement in performance from the purple to the orange bars. The cost of this improvement, of course, was the additional memory we needed to hold the data set in memory (up to 124 GB for a data set size of 800 million images). But if we loosen our mental restriction about memory, we can do better. Using locality-sensitive hash indexes, we can group similar images in our data set, and reduce the number of comparisons necessary to a small subset of the total database. We obtained an additional 10-20x improvement (shown in the green bars below) in performance by indexing. Again, the cost was the additional memory we needed to hold the index, almost 6 TB for 800 million images in our example; we would have written this solution off as “impractical” earlier when we treated memory as being a scarce resource.

Image_Search_Results.jpgImage search comparison

Memory is persistent

Databases require that the data being held within them be maintained in consistent state regardless of system failures. Thus a large amount of software within databases deals with techniques to maintain consistency between the working set within processor memory and the persistent state on disk. What would happen if memory were persistent? A large fraction of the software within the database would become redundant. The Figure below compares a database kernel written for persistent memory (FOEDUS: Fast Optimistic Engine for Data Unification Services) with some other in-memory database kernels. Since FOEDUS assumes that data in memory survives power failures, it no longer requires the overhead of maintaining logs on disk (although it does require them for transactional consistency). The experiment indicates that we can speed up transactions by two orders of magnitude or more if memory becomes persistent.database kernel performance comparison.pngA comparison of FOEDUS versus H-StoreMemory is shared

Our third example is one of  iterative graph processing (LSGi: Large Scale Graph inferencing). Graphs represent variables and relationships between them directly in data structures. Large graphs are hard to partition across clusters, and applications using them have high communication overhead because most accesses in the data become “remote” for large graphs. The Machine’s Memory-Driven Computing architecture, however, allows us to hold all of the data in one place in shared memory and concurrently allow each processor in the cluster to access all of the data in memory. Since we expect the shared memory to be slower than DRAM, the application uses local DRAM attached to each processor as a cache, and judiciously copies data back and forth from fabric-attached memory that is common to all processors. By shifting all data movement to memory-to-memory copies rather than over the Ethernet, and removing the overhead of synchronization between processes, we can speed up the processing by a factor of over 100x.
graph processing performance comparison.pngComparison of LSGi versus GraphLabIn conclusion

The examples above, while no means proving that all applications will run faster on The Machine, suggest that for performance-hungry applications, significant gains are achievable with the new architecture. However, in all cases we explored, getting these performance gains required us to let go of programming limitations we had mentally imposed on ourselves. The new architecture enabled us to trade memory for speed in the programs. It allowed us to reduce code (and the overhead that came with it) to maintain data in the presence of failure. Finally it allowed us to leverage shared-memory concurrent programming across multiple operating systems, something that is not possible today.

Interestingly, most of the tests above were done by emulating The Machine architecture on a Superdome X—the mind shift we needed to make does not need to wait for The Machine, and the lessons we learned can be applied to problems running on large memory systems available today. 

0 Kudos
About the Author

ssinghal

Software and Applications for The Machine