Diagnosing Application-network Anomalies for Millions of IPs in Production Clouds

Authors: 

Zhe Wang, Shanghai Jiao Tong University; Huanwu Hu, Alibaba Cloud; Linghe Kong, Shanghai Jiao Tong University; Xinlei Kang and Teng Ma, Alibaba Cloud; Qiao Xiang, Xiamen University; Jingxuan Li and Yang Lu, Alibaba Cloud; Zhuo Song, Shanghai Jiao Tong University and Alibaba Cloud; Peihao Yang, Alibaba Cloud; Jiejian Wu, Shanghai Jiao Tong University; Yong Yang and Tao Ma, Alibaba Cloud; Zheng Liu, Alibaba Cloud and Zhejiang University; Xianlong Zeng and Dennis Cai, Alibaba Cloud; Guihai Chen, Shanghai Jiao Tong University

Abstract: 

Timely detection and diagnosis of application-network anomalies is a key challenge of operating large-scale production clouds. We reveal three practical issues in a cloud-native era. First, impact assessment of anomalies at a (micro)service level is absent in currently deployed monitoring systems. Ping systems are oblivious to the "actual weights'' of application traffic, e.g., traffic volume and the number of connections/instances. Failures of critical (micro)services with large weights can be easily overlooked by probing systems under prevalent network jitters. Second, the efficiency of anomaly routing (to a blamed application/network team) is still low with multiple attribution teams involved. Third, collecting fine-grained metrics at a (micro)service level incurs considerable computational/storage overheads, however, is indispensable for accurate impact assessment and anomaly routing.

We introduce the application-network diagnosing (AND) system in Alibaba cloud. AND exploits the single metric of TCP retransmission (retxs) to capture anomalies at (micro)service levels and correlates applications with networks end-to-end. To resolve deployment challenges, AND further proposes three core designs: (1) a collecting tool to perform filtering/statistics on massive retxs at the (micro)service level, (2) a real-time detection procedure to extract anomalies from ‘noisy’ retxs with millions of time series, (3) an anomaly routing model to delimit anomalies among multiple target teams/scenarios. AND has been deployed in Alibaba cloud for over three years and enables minute-level anomaly detection/routing and fast failure recovery.