Algorithmic Foundations of Distributed Large-Scale Graph Processing
DescriptionAchieving efficient computation on massive graph data sets is a fundamental challenge with numerous applications, ranging from social network analysis and machine learning to protein interaction graphs studied in computational biology. The scale of such massive graphs, which often exceeds the memory and bandwidth capabilities of a single machine has prompted practitioners to develop distributed graph processing frameworks such as Google Pregel, Apache Spark/GraphX and Giraph. The overarching goal of this project is to develop the theoretical foundations of algorithms for efficient distributed computation on massive graphs in computing models that capture the main features of real-world graph processing frameworks. This includes the design of new distributed algorithms that scale to massive graphs for problems such as computing shortest distances between vertices, computing maximal independent sets, and finding sparse subgraphs (i.e. with few edges) that exhibit certain properties. Given that real-world data sets are continuously evolving, we also plan to derive new algorithms that can handle dynamically-changing input graphs. The obtained algorithms will be experimentally evaluated by implementing them in open source graph processing frameworks, which will have a tangible impact on industrial practice.A crucial challenge for performing fast distributed computation on large graphs is to come up with algorithms that do not send too many messages, i.e., are communication-efficient. We will explore new algorithmic techniques to overcome this obstacle and we anticipate that this also provides insights on designing communication- and time-efficient solutions for classic distributed computing models, where each graph vertex represents a computational entity.A complementary goal of the project is to understand how much communication a distributed graph algorithm needs to perform, while still being able to quickly arrive at a correct solution. More specifically, this work will reveal the complexity lower bounds on the trade-offs between time and the required communication in distributed computing models, which is still an open question for many fundamental graph problems. To place our results in a broader context, we will investigate the complexity-theoretic foundations of problems in the large-scale graph setting by developing complexity classes and providing simulations between different computational models for massive data sets.
|Effective start/end date||1/01/21 → …|