TY - GEN
T1 - GraphTrek
T2 - IEEE International Conference on Cluster Computing, CLUSTER 2015
AU - Dai, Dong
AU - Carns, Philip
AU - Ross, Robert B.
AU - Jenkins, John
AU - Blauer, Kyle
AU - Chen, Yong
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/10/26
Y1 - 2015/10/26
N2 - Property graphs are a promising data model for rich metadata management in high-performance computing (HPC) systems because of their ability to represent not only metadata attributes but also the relationships between them. A property graph can be used to record the relationships between users, jobs, and data, for example, with unique annotations for each entity. This high-volume, power-law distributed use case is a natural fit for an out-of-core distributed property graph database. Such a system must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenance data mining). Large-scale property graph traversals are particularly challenging for distributed graph databases, however. Most existing graph databases implement a 'level-synchronous' breadth-first search algorithm that relies on global synchronization in each traversal step. This traversal model performs well in many problem domains, but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers. We define stragglers as abnormally slow steps (or servers) in a graph traversal that lead to low overall throughput for synchronous traversal algorithms. The straggler problem can be mitigated by the use of asynchronous traversal algorithms. Asynchronous traversal has been successfully demonstrated in graph processing frameworks, but such systems require the graph to be loaded into a separate batch-processing framework. In this work, we propose GraphTrek, a general asynchronous graph traversal engine working with graph databases for processing rich metadata management in their native format. We also outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed.
AB - Property graphs are a promising data model for rich metadata management in high-performance computing (HPC) systems because of their ability to represent not only metadata attributes but also the relationships between them. A property graph can be used to record the relationships between users, jobs, and data, for example, with unique annotations for each entity. This high-volume, power-law distributed use case is a natural fit for an out-of-core distributed property graph database. Such a system must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenance data mining). Large-scale property graph traversals are particularly challenging for distributed graph databases, however. Most existing graph databases implement a 'level-synchronous' breadth-first search algorithm that relies on global synchronization in each traversal step. This traversal model performs well in many problem domains, but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers. We define stragglers as abnormally slow steps (or servers) in a graph traversal that lead to low overall throughput for synchronous traversal algorithms. The straggler problem can be mitigated by the use of asynchronous traversal algorithms. Asynchronous traversal has been successfully demonstrated in graph processing frameworks, but such systems require the graph to be loaded into a separate batch-processing framework. In this work, we propose GraphTrek, a general asynchronous graph traversal engine working with graph databases for processing rich metadata management in their native format. We also outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed.
KW - Graph Traversal
KW - Parallel File Systems
KW - Property Graph
KW - Property Graph Databases
KW - Rich Metadata Management
UR - http://www.scopus.com/inward/record.url?scp=84959302899&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2015.48
DO - 10.1109/CLUSTER.2015.48
M3 - Conference contribution
AN - SCOPUS:84959302899
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 284
EP - 293
BT - Proceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 8 September 2015 through 11 September 2015
ER -