Stochastic Resource Optimization for Coded Distributed Computing Systems

Student thesis: Doctoral Thesis

Abstract

Distributed computing has become fundamental in modern network architectures, where master nodes allocate tasks to worker nodes and aggregate the computed results. In practical systems—such as edge computing networks—challenges including heterogeneity, uncertainty, and resource constraints frequently emerge. This thesis explores coded computing — a framework that leverages erasure coding techniques — to optimize distributed computing systems by reducing system delays and addressing the corresponding energy-delay tradeoffs across a variety of applications.

For the primary goal of delay reduction, each worker is assumed to have a random task completion time to capture the system's time-varying properties. By using erasure coding, reliable computation is achieved by utilizing only the fastest subset of workers, treating delayed results as erasures. Recognizing that real-world networks impose diverse constraints, the thesis considers two distinct practical scenarios where uncertainty and randomness are included, each naturally addressed by stochastic optimization methods. The primary distinction between these scenarios lies in the extent of the master's prior knowledge regarding the workers' delay distributions—specifically, whether the master knows the type of delay distribution, even if the parameters remain unknown.

In the first scenario, termed the model-based uncertain case, the master is aware of the distribution type for each worker's delay. With a limitation on the number of workers that can be selected due to offloading cost concerns, a two-step online decision-making process is introduced. First, Thompson Sampling (TS) is used to identify the group of fastest workers, and subsequently, a proportional-greedy algorithm allocates workloads among them based on the estimated delay parameters. In the second scenario, identified as the model-free uncertain case, the type of the delay distribution is unknown. With relaxed restrictions on the number of selected workers, a grey-box Bayesian Optimization (BO) approach is proposed. This method exploits the affine relationship between workers' response times and their assigned workloads, using dimension reduction techniques to efficiently find suboptimal solutions. The observed affine relationship further enables the extension of these methods to Federated Learning (FL), thereby underscoring their applicability in diverse distributed computing systems beyond coded computing.

While coding effectively reduces delays, it can also lead to increased energy consumption and under-utilization of slower workers, raising the concern of the energy-delay tradeoff. To tackle this complex challenge, the thesis examines two additional scenarios under systems with known delay-related parameters. The first scenario focuses on a homogeneous stochastic system—where workers share identical delay distributions, task capacity constraints, and power coefficients—which simplifies the analysis. The second scenario considers heterogeneous worker properties in a system with deterministic delays. For both cases, the energy-delay tradeoff is modeled according to the system-specific characteristics, and optimal solutions are derived. For large-scale systems with a high number of workers or tasks, approximation problems are formulated, yielding suboptimal solutions with reduced computational complexity.

In summary, this thesis studies resource optimization in distributed coded computing by developing practical schemes for delay reduction and energy-delay tradeoffs. The proposed approaches are thoroughly validated through simulations and numerical studies, demonstrating their effectiveness.
Date of Award29 Aug 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorChi Wan SUNG (Supervisor)

Cite this

'