- Community Home
- >
- Servers and Operating Systems
- >
- Servers & Systems: The Right Compute
- >
- Cray Graph Engine takes on a trillion triples
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Receive email notifications
- Printer Friendly Page
- Report Inappropriate Content
Cray Graph Engine takes on a trillion triples
A paper submitted by a team of Cray engineers received the 2018 Best Paper Award for “Loading and Querying a Trillion RDF Triples with Cray Graph Engine on the Cray XC.” Learn why.
One trillion. That’s quite a large number. It’s 132 times the number of people on Earth; five times the number of tweets posted in a year on Twitter; the number of search engine queries Google serves in five months; and now equivalent to the number of triples we’ve loaded, inferenced, and queried using the Cray Graph Engine.
We know Cray isn’t your only option for graph database technology capable of loading a trillion-triple database. Take a look at the W3C wiki page on large triplestores, where Oracle and Franz Inc. have posted trillion-triple results; and Cambridge Semantics, where a similar benchmark was announced.
But I can say with pride that we are the best, and I’ll use the rest of this post to support that claim.
First, I’ll come out and admit that absent any widely recognized reporting standard—like those published by the Transaction Processing Performance Council—comparing benchmark results in the semantic graph field can be a bit of an “apples to oranges” comparison; there is no single agreed-upon measure of relative value. And the infrastructures themselves vary widely. Oracle’s results are measured using an integrated database appliance, Cambridge Semantics results are on a Google Cloud infrastructure, and our results are achieved using a combination of a Cray XC series system and Lustre®-based ClusterStor storage.
Despite the differences in infrastructure, the results are impressive
Here is a comparison between our results and previously published results from Oracle and Cambridge Semantics:
The results reported in the paper were made using the Lehigh University Benchmark (LUBM) for semantic web repositories. The LUBM benchmark has become a de facto standard for graph database benchmarks and describes the structure of a university with departments, courses and students. By using the 5500K scale, our team generated an artificial dataset representing 5,500,000 universities.
I can’t help but note that the aforementioned results reported by Oracle and Cambridge Semantics used a smaller 4400K scale (meaning they approximated the data for 4,400,000 universities). Why the difference? Our inferencing engine is more efficient, so to get to the magic 1 trillion mark, we needed to start with a larger initial dataset.
So, let’s look at some of the good stuff
- Data loading performance: 177 million quads loaded and indexed per second
- Inference performance: 501 million quads inferenced per second
- SPARQL query performance: 96.3 seconds to run all LUBM queries
Astute readers may notice the slower load time we report versus the load time reported by Cambridge Semantics. This is a perfect example of an apples-to-oranges comparison related to the benchmark system setup. In our case, the initial data generated by the lubm-uba data generator was written to a Lustre file system spread out over 96 files of 1.4 TB each. In the case of the Cambridge Semantics benchmark, the LUBM 4400K data was generated locally and stored on SSDs in each of the nodes, with loading time measured as the time to load from the local SSDs to memory simultaneously. We believe that a more realistic load time for Cambridge Semantics should have included the time required to copy the data to the SSDs in the first place.
But seriously … 96 seconds for all 14 LUBM queries is outstanding. This is nearly an order of magnitude faster than the Cambridge Semantics results!
And to put the trillion-triple dataset in perspective, the team also loaded a graph database that linked public datasets commonly used in life sciences and systems biology, where a typical workflow for researchers might be: perform searches in one of the databases, construct queries for another database, and iterate.
The union of these databases dwindles in comparison to a 1T triple dataset—0.05T triples combined—but highlights the benefit of using large lab-local datasets. Fast CGE load times—highlighted in the paper—enable frequent updates of the working knowledge set many times a day. Furthermore, combining all those datasets (as named graphs) in one CGE instance makes complex cross-database queries feasible, reducing time to insight.
Behind the benchmark effort is a focused effort by the Cray R&D team to improve the overall performance of the Cray Graph Engine. Our next release—CGE 3.2UP01 — will include the enhancements that made these results possible.
This blog originally published on cray.com and has been updated and published here on HPE’s Advantage EX blog.
Paul Hahn
Hewlett Packard Enterprise
twitter.com/hpe_hpc
linkedin.com/showcase/hpe-ai/
hpe.com/info/hpc
- Back to Blog
- Newer Article
- Older Article
- PerryS on: Explore key updates and enhancements for HPE OneVi...
- Dale Brown on: Going beyond large language models with smart appl...
- alimohammadi on: How to choose the right HPE ProLiant Gen11 AMD ser...
- ComputeExperts on: Did you know that liquid cooling is currently avai...
- Jams_C_Servers on: If you’re not using Compute Ops Management yet, yo...
- AmitSharmaAPJ on: HPE servers and AMD EPYC™ 9004X CPUs accelerate te...
- AmandaC1 on: HPE Superdome Flex family earns highest availabili...
- ComputeExperts on: New release: What you need to know about HPE OneVi...
- JimLoi on: 5 things to consider before moving mission-critica...
- Jim Loiacono on: Confused with RISE with SAP S/4HANA options? Let m...
-
COMPOSABLE
77 -
CORE AND EDGE COMPUTE
146 -
CORE COMPUTE
155 -
HPC & SUPERCOMPUTING
138 -
Mission Critical
88 -
SMB
169