Reducing Impact of Failures in Data Centers and WANs


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date9 Feb 2018


Network faults like link or switch failures can cause severe performance degradation to the Internet services running on top of the data center networks (DCNs) or cause heavy congestion and packet loss in data center WANs (DCWANs). Thus, it is crucial to design methods to alleviate failure impact on both DCNs and inter-DC WANs. After studying related works, we have the following two findings.

(i) While many efforts have been devoted to failure detection, diagnosis, and mitigation/recovery, relatively little work has been done on inference of performance degradation before failures actually happen. (ii) Traffic engineering systems need a lot of time to detect and react to faults, which results in slow recovery. Recent works either pre-install a lot of backup paths at switches to ensure fast reroute, or proactively pre-reserve bandwidth to achieve fault-resiliency. Yet very few works focus on reacting to failures in data plane while eliminating pre-installation of backup paths.

This thesis addresses the above two challenging problems and results are published in two papers, respectively. (i) Sibyl: A method to infer performance change before failures really happen in DCNs. Different from previous works, Sibyl relies on network topology information to infer network performance under failure scenarios without the overhead of active measurements. Specifically, we demonstrate that the most important performance metrics can be obtained from two fundamental topological metrics, the shortest path and the maximum number of edge-disjoint paths. We develop efficient algorithms to obtain these two fundamental metrics, leveraging graph automorphism property of various DCN topologies. (ii) Kuijia: A robust traffic engineering system for inter-DC WANs which relies on a novel failover mechanism in data plane called rate rescaling. The victim flows on failed tunnels are rescaled to the remaining tunnels, and enter lower priority queues to avoid performance impairment of aboriginal flows on remaining tunnels.

Collectively, the main contribution of this thesis is to provide mechanisms to guarantee network reliability and as a result improve quality of network services.