Detecting Fail-Slow Failures in Large-Scale Cloud Storage Systems

February 6, 2023

Deployed System

Authors:

Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, Jiesheng Wu

Article shepherded by:

Rik Farrow

Modern datacenters are susceptible to various types of failures in the field, such as fail-stop, fail-partial, and Byzantine. The recent advancement in both software and hardware pushes the previously overlooked fail-slow failure to the fore. Fail-slow components, stuck in an intermediate state between full-speed and non-functional, tend to exhibit rather low performance. Prior work and our analysis suggest that fail-slow failures are not only prevalent but also impactful in the field. In this article, we share our lessons and experiences in building a practical fail-slow detection framework for storage devices in the cloud.

Fail-Slow in the Wild

Severity

The fail-slow failure is a widespread and severe issue for the datacenters. First, recent studies on storage systems reliability have shown that the annual fail-slow rate, ranging from 1% to 2%, is already as frequent as annual fail-stop incidents [7, 9]. Second, fail-slow failures, especially the transient ones, can often be ignored or misinterpreted as performance variations. In this case, it is hard for on-site engineers to pinpoint fail-slow, much less reproduce or reason the root causes [3]. Most importantly, fail-slow failures can have a significant impact on the I/O performance in the wild.

Hard to Detect

Accurate detection of fail-slow failures is challenging. Normal performance variations (e.g., triggered by workload burst or SSD internal garbage collection) may generate symptoms similar to fail-slow failures. In contrast to fail-stop failures where the criterion is well-defined (i.e., working or stopped working), there are a thousand ways to define fail-slow. On-site engineers, for example, are frequently perplexed and wonder how slow a component must be to be labeled fail-slow. When there are no golden standards, the detection procedure would be empirical and hence necessarily inaccurate.

Existing solutions to fail-slow detection in storage systems are impractical and inefficient for large-scale cloud deployment. First, these solutions require source code access or software modification (e.g., modifying software timeouts in IASO [9]), whereas cloud vendors like us do not touch tenants’ code. Even for in-house storage infrastructures, inserting certain code segments is still time-consuming since the systems might operate dozens of internal services with varying software stacks. Second, existing solutions can only identify fail-slow failures at the node level (e.g., IASO), thus still requiring nontrivial manual efforts to locate the culprits.

Design Goals of Fail-slow Detection

From our perspective, a practical fail-slow detection framework should have the following properties.

Non-intrusive. We as cloud service providers cannot alter users’ software or require them to run certain modified versions of software stacks. Instead, we may only make use of external performance statistics like drive latency to detect fail-slow.
Fine-grained. Fail-slow root cause diagnosis can often be time-consuming (from days to even weeks [3, 9]). The framework is expected to directly pinpoint the culprit.
Accurate. The framework should have satisfying precision and recall to avoid unnecessary diagnosis of false positives.
General. The framework can be quickly applied to different cloud services (like block storage and database) with minor adjustments.

Background

Alibaba Storage System Architecture

Our data centers follow a classic distributed storage system setup. We have multiple Internet Data Centers (IDCs) across the world. Each IDC includes multiple storage clusters. Atop each cluster, a distributed file system (DFS) is deployed to support a dedicated service (e.g., block storage, NoSQL, or big data analysis). Each cluster consists of tens of racks (at most 200), and each rack contains dozens of nodes (at most 48). Each node contains dozens of storage devices (HDDs or SSDs) that are by default from the same drive model. A more detailed description of our architecture can be found in our paper [6] and open dataset [1].

Dataset Description

In our data centers, a monitoring daemon is placed in each node to collect operational statistics, mainly the latency and throughput of each drive. The daemons calculate the average statistics every 15 seconds and record them as time-series data entries. The daemons run three hours a day (from 9PM to 12AM). A drive generates 720 entries (= 180 min×4 entries/min) per day. In total, we have compiled around 100 billion entries as our dataset.

Unsuccessful Attempts & Lessons

We have explored multiple solutions to fail-slow detection at Alibaba. In this section, we discuss three unsuccessful attempts over the years and summarize these early efforts into a series of lessons learned as design guidelines of PERSEUS.

Attempt 1: Threshold Filtering

Methodology. Intuitively, we can set up a hard threshold on drive latency to identify fail-slow drives based on the Service Level Objectives (SLOs). To avoid mislabeling due to one-off events such as SSD internal Garbage Collection (GC), we further specify a minimum slowdown span for a suspicious drive to be considered fail-slow.

Limitation. The detection accuracy of threshold filtering is low, as the latency is highly influenced by the workloads. Here, we use the latency and throughput traces of an NVMe SSD from the block storage service as an example. The left of Figure 1 illustrates the latency variation of the drive where the horizontal dashed line indicates the threshold (45µs). The right of Figure 1 is the corresponding throughput. We can see that three latency spikes occur at around 21:29, 21:34, and 21:40 as the latency increases to 65µs. By comparing latency with throughput, it is clear that the workload pressure causes these spikes.

Hence, the dilemma is as follows. Setting a relaxed threshold easily mislabels normal performance variations as fail-slow events. Meanwhile, a strict one could leave many fail-slow cases undiscovered. Further, using a set of thresholds for different scenarios through fine-tuning can be fairly time-consuming as our experiments show that latency variation is a factor of drive models and workloads. In practice, we use threshold-based detection as a fail-safe measure like timeouts.

Figure 1: Sudden latency increase due to heavy load. The figures present the time series of (a) write latency and (b) write throughput of an NVMe SSD. Red points in grey boxes refer to time segments where drive latency is higher than a naive alarming threshold (i.e., 45 µs).

Attempt 2: Peer Evaluation

Methodology. The problem with the first attempt is not having an adaptive threshold for detection. To address this problem, we explored the idea of peer evaluation [4, 7]. The rationale behind this approach is that, with load balancing across the distributed storage system, drives from the same node should receive similar workload pressure. Since fail-slow failures are relatively rare [3] and the majority of drives in a node should be normal, we can identify the fail-slow drive by comparing the performance between drives from the same node.

Specifically, we first calculate the node-level moving median latency for every 15 seconds. We then evaluate whether there are drives constantly (more than half of the time) delivering abnormal performance—twice slower than the median in our case—during the monitoring window (e.g., 5 minutes). If so, the detection framework reports a fail-slow event, and the monitoring window moves forward to start the next round of evaluation.

Limitation. Peer evaluation can obtain an adaptive threshold (the blue line in Figure 2), but it requires more empirical parameters than threshold filtering, such as the slowdown degree and the monitoring window span, for tuning. Although it is possible to fine-tune the parameters on a few clusters for specific storage services, the effort would be prohibitively large if we want to extend peer evaluation to other drive models of different services. For example, it took on-site engineers two hours to fine-tune a cluster with around 300 nodes, and this set of parameters fails to work on another cluster even under the same service with the exact same models of drives.

Figure 2: Peer evaluation. The figure shows peer evaluation results by comparing the candidate drive latency (in red) with node-level median latency (in blue) using a 5-minute sliding window. The candidate is slow (> 50% latency records higher than 2×Median) in three time windows (labeled with numbers in grey boxes).

Attempt 3: IASO-Based Model

Methodology. IASO is a fail-slow detection framework focusing on identifying performance-degrading nodes [9]). The design principle of IASO is to leverage software timeouts and convert them into informative metrics to benchmark fail-slow. However, directly using IASO is not suitable for us. First, IASO requires code changes (i.e., intrusive monitoring) to insert or modify certain code snippets of the running instances (e.g., Cassandra and ZooKeeper), thus leveraging software-level timeouts to identify fail-slow incidents. Second, IASO is node-level detection, whereas our goal is device-level. Nevertheless, we re-factor IASO with our best effort. To avoid modifying the software, we reuse the fail-slow event reporting by peer evaluation from Attempt 2.

Limitation. The IASO-based model delivers rather unsatisfactory performance on our assembled benchmark, with a precision rate of only 0.48. We suspect the main reason is that using the fail-slow event reporting to replace the software timeout might not be effective. Moreover, we have explored other possible alternatives, such as replacing software timeouts with thresholds. However, the results are still unsatisfactory. Therefore, we believe IASO may not achieve our goals even with refactoring.

Guidelines for PERSEUS

The aforementioned methods are either labor-intensive (requiring extensive tuning) or ineffective in the field. In this section, we use a series of research questions as guidelines for designing our next fail-slow detection framework.

Requirement 1: How to model workload pressure?

Simply depending on latency to detect fail-slow failures is unreliable. Our previous exploration has shown that workload pressure can significantly influence the latency variation. Analyzing Figure 1a and Figure 1b inspires us to check the possibility of using throughput or IOPS to model workload. Based on statistical analysis, latency has a stronger correlation with throughput than with IOPS. Thus, we decide to use throughput for modeling the workload pressure.

Figure 3: Distinct LvT distribution. The figures show the latency-vs-throughput (LvT) distribution of (a) three clusters (in yellow, blue and red) from a database service, (b) two representative nodes (in blue and red) with clear-cut distribution from the same cluster (also from a database service), and (c) drives (in distinct colors) in one node from database service. Throughput (unit: B/s) is scaled by log base 10 here and in related figures hereafter.

Requirement 2: How to automatically derive adaptive thresholds?

Although peer evaluation in Attempt 2 can provide adaptive thresholds, it requires time-consuming tuning for different service types and drive models. Now, with workload pressure modeled by throughput, we are able to build the latency-vs-throughput (LvT) distribution. Then, we can use regression models on such distribution to define a statistically normal drive and subsequently use its upper bound as the adaptive threshold for various environments.

To build such regression models, we need to determine the scope of drives to be included in the LvT distribution. The tradeoff is that including more samples (e.g., all drives from the service) can be more statistically confident but subject to a more diverse distribution—difficult to derive a clear upper bound. Therefore, we plot the distribution at three different scales in Figure 3 and discuss their pros and cons as follows.

Service-wise and cluster-wise. Figure 3a illustrates the service-wise LvT distribution of drives from three clusters (marked as red, yellow and blue) in the database service. Obviously, clusters from the same service can have drastically different LvT distributions. The same is true for cluster-wise distribution, as two nodes from the same cluster share disjoint LvT distributions in Figure 3b. After statistical analysis, we discover that the disparity is widespread. The huge gap between distributions indicates that we cannot rely on service-wise or cluster-wise peer evaluation.

Node-wise. In Figure 3c, we plot the LvT distribution of drives from one all-flash node with 12 SSDs. Each drive is represented with a color. We can see the colors are well clustered together which indicates drives from the same node follow a similar LvT distribution. Note that we also examine nodes from other services and confirm that such behaviors persist across different node configurations and cloud services. Thus, we decide to use the node-wise samples to build the LvT distribution.

Requirement 3: How to identify fail-slow without a criterion?

Unlike fail-stop failures, there are no clear criteria for detecting fail-slow drives. Therefore, we rethink our strategy for fail-slow detection. Instead of relying on the framework to output a binary result (fail-slow or not), the detection tool should describe the likelihood of a drive to fail-slow. With sufficient accuracy, on-site engineers can focus on the most severe ones. While this may still leave some fail-slow drives undiscovered, it is acceptable as they behave like normal performance variations.

Figure 4: PERSEUS design diagram. From the raw data, PERSEUS seeks to distinguish the slow (in red) from the normal (in grey) by building a regression model (2) with preliminary outlier detection (1). With a sliding window, PERSEUS formulates consecutive slow records into slowdown events (3), and assigns risk scores based on slowdown duration and degree (4).

PERSEUS

With lessons from previous attempts, we propose PERSEUS, a non-intrusive, fine-grained and general fail-slow detection framework. As illustrated in Figure 4, the core idea of PERSEUS is building a polynomial regression on the node-level LvT distribution to automatically derive an adaptive threshold for each node. PERSEUS can use the threshold to formulate fail-slow events and further use a scoreboard mechanism to single out the drives with severe fail-slow failures. In this section, we discuss the design of each step at length.

Step 1: Outlier Detection

Before applying regression models, a necessary pre-process is to root out noisy samples (i.e., outliers). While the LvT samples (i.e., <latency, throughput> pairs) are usually clustered together within a node, entries from fail-slow drives or under normal performance variations (e.g., internal GC) can still be deviating. Therefore, before building a polynomial regression model, we first screen out the outliers.

Using DBSCAN. Density-based clustering algorithms (measuring the spatial distance) are promising approaches for identifying the potentially distinctive groups (i.e., normal vs. slow). Initially, we employ DBSCAN [10] to label outliers. In a nutshell, DBSCAN groups points that are spatially close enough—distances between points are smaller than a minimum value. Note that <latency, throughput> pairs from long-term or permanent fail-slow drives can be clustered together but far away from the main cluster. Hence, we only keep one group with the most points for further modeling.

Figure 5: Outlier detection. The figures show the performance of the regression model based on two outlier detection schemes: (a) only DBSCAN and (b) PCA prior to DBSCAN. All data records come from the same node with one known fail-slow drive during the day.

Adding PCA. Using DBSCAN alone can have limited effectiveness. Here, we choose a sample node with one confirmed fail-slow drive to illustrate the limitation. In Figure 5a, we apply fine-tuned DBSCAN (outliers in red points) to the node’s daily raw dataset and fit the rest of the data (grey points) to polynomial regression (with a fitted curve in blue and a 99.9% prediction upper bound in green dashed line). In this example, the DBSCAN algorithm only identifies 63.83% of slow entries.

The root cause is that the throughput and latency are positively correlated. Thus, the samples (i.e., <latency, throughput> pairs) can be skewed towards a particular direction. Hence, outliers (i.e., samples from fail-slow drives) can be mislabeled as inliers (see the black circle in Figure 5a). Therefore, we leverage Principal Component Analysis (PCA [2]) to transform the coordinates and penalize the outliers perpendicular to the skewed direction in order to reduce mislabeling. As a result, applying DBSCAN with PCA effectively detects 92.55% of slow entries (see Figure 5b).

Usage of outliers. In RQ3, we have discussed that just using binary detection cannot reflect the extent of slowdown. Therefore, we do not directly use the binary results of outlier detection, such as simply labeling outliers as slow entries (i.e., skipping Step 2) or fail-slow events (i.e., skipping Step 2 and Step 3). Rather, we exclude the outliers to build a better-fitted model for measuring the slowdown degree of entries.

Step 2: Regression Model

As normal drives within a node can have similar latency-vs-throughput mapping (i.e., well clustered together), we can use a regression model to describe the behavior of a “normal” drive and delineate the scope of variation for fail-slow detection. Classic regression models include linear, polynomial, and advanced ones like kernel regression. We do not use linear regression as the latency dependency on throughput is obviously nonlinear (e.g., see Figure 5). Moreover, advanced models (e.g., kernel regression) are unnecessary as the latency-vs-throughput mapping is primarily monotonic (i.e., latency increases along with the throughput). Polynomial regression is preferable as it handles nonlinearity while retaining model parsimony (i.e., achieving the desired goodness of fit with just enough parameters).

Step 3: Identifying Fail-Slow Event

Distinguishing slow entries. After obtaining the regression model, we can calculate a prediction upper bound to distinguish the slow entries, and use it to detect fail-slow events. For example, a 99.9% upper bound means that 99.9% of the variations are deemed normal while the rest 0.01% are slow.

Figure 6: Identifying a slowdown event. Figure (a) shows the original (“Data”) time series of drive latency, together with the corresponding fitted values (“Fit.”) and upper bounds (“Upp.”) calculated from the regression model. Figure (b) shows the time series of slowdown ratio by dividing the upper bound by thoriginal data. Records with a slowdown ratio higher than 1 are deemed as slow. The two grey boxes refer to a latency spike (1) and a transient slowdown event (2).

Formulating events. First, we calculate a time series of Slowdown Ratio (SR), which is obtained by dividing drive latency by the prediction upper bound entry by entry (every 15 seconds). We then formulate fail-slow events by using a sliding window (similar to Attempt 2). The sliding window has a fixed length (i.e., a minimum span) and starts at the first entry. Within the span, if a certain proportion of SR series has a median SR value exceeding a threshold, PERSEUS would record that the drive has encountered a fail-slow event within the span and see if the event should be extended to the next entry. In practice, we only formulate fail-slow events under a persistent series of slowdown entries as one-off spike entries are likely to be acceptable performance variations. In Figure 6b, while both (1) and (2) have high SR values, only (2) would be marked as a fail-slow event.

Step 4: Risk Score

Recall our discussion on RQ3 and the usage of outlier detection, we also do not simply rely on the existence of fail-slow events to label the corresponding drive as “fail-slow.” In fact, if we simply mark drives with one fail-slow event as fail-slow, we can easily obtain 6K such “fail-slow” cases on a bad day.

Table 1: Fail-slow risk matrix. PERSEUS assigns risk levels based on daily accumulated slowdown events. For example, drives at extreme risk for one day should experience a long-term slowdown (for 120-180 minutes in total) with a severe slowdown ratio (SR) on average (SR ≥5).

Therefore, we adopt the idea of establishing a risk score mechanism from performance regression testing [5] to quantify the severity of fail-slow. First, slowdown duration and severity are classified into different risk levels as illustrated in Table 1. For example, extreme risk of fail-slow corresponds to drives with a long-term daily slowdown span (over 120 minutes) and severe slowness (SR≥5). Furthermore, a per-drive risk score is calculated by assigning different weights to risk levels:

If a drive whose risk scores exceed a minimum value within a few days, the drive will be recommended for immediate isolation and hardware inspection. Note that all drives in our fleet, HDDs and SSDs, use the same scoring mechanism.

Evaluation On Assembled Dataset

We have assembled the first large-scale, clear-labeled public dataset on real-world operational traces aiming at fail-slow detection. The dataset has been released to the public for fail-slow study [1]. Based on this dataset, we extensively evaluate previous attempts and PERSEUS. We adopt three evaluation metrics: precision rate, recall rate, and Matthews Correlation Coefficient (MCC [8]). Table 2 summarizes the performance results. Clearly, PERSEUS outperforms all previous attempts. The high precision and recall indicate that PERSEUS can successfully detect all fail-slow drives while rarely mislabeling normal as fail-slow. Therefore, we conclude that PERSEUS achieves our design goals as a fine-grained (per-drive), non-intrusive (no code changes), general (same set of parameters for different setups) and accurate (high precision and recall) fail-slow detection framework.

Table 2: Overall evaluation results. The table shows the evaluation scores of threshold filtering based on statistical (Thresh-Stat) and empirical (ThreshEmp) bounds, peer evaluation (PeerEval), IASO-based model, and PERSEUS.

Conclusion

In this paper, we first share our unsuccessful attempts in developing robust and non-intrusive fail-slow detection for large-scale storage systems. We then introduce the design of PERSEUS, which utilizes classic machine learning techniques and scoring mechanisms to achieve effective fail-slow detection. Since deployment, PERSEUS has covered around 250K drives and successfully identified 304 fail-slow drives. We have released our dataset as an open benchmark dedicated to fail-slow drive detection.

Appendix

References:

[1] Fail-slow detection open dataset. https://tianchi.aliyun.com/dataset/144479

[2] Hervé Abdi and Lynne J. Williams. Principal component analysis. WIREs Computational Statistics, 2010.

[3] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST), 2018.

[4] Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. The tail at store: A revelation from millions of hours of disk and ssd deployments. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST), 2016.

[5] Peng Huang, Xiao Ma, Dongcai Shen, and Yuanyuan Zhou. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th International Conference on Software Engineering (ICSE), 2014.

[6] Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, and Jiesheng Wu. Perseus: A fail-slow detection framework for cloud storage systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST), 2023.

[7] Ruiming Lu, Erci Xu, Yiming Zhang, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Minglu Li, and Jiesheng Wu. NVMe SSD failures in the field: the Fail-Stop and the Fail-Slow. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022.

[8] Brian Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975.

[9] Biswaranjan Panda, Deepthi Srinivasan, Huan Ke, Karan Gupta, Vinayak Khot, and Haryadi S. Gunawi. Iaso: A fail-slow detection and mitigation framework for distributed storage services. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC), 2019.

[10] Erich Schubert, Jörg Sander, Martin Ester, Hans Kriegel, and Xiaowei Xu. Dbscan revisited, revisited: Why and how you should (still) use dbscan. ACM Transactions on Database Systems, 2017.

Article Categories:

SRE

Filesystem/storage

Last updated February 9, 2023

Authors:

Ruiming Lu is currently a third-year Ph.D. student in the Department of Computer Science and Engineering at Shanghai Jiao Tong University. His research focuses on the reliability aspect of modern data centers. Specifically, he works towards analyzing the failure characteristics of massively-deployed storage devices, understanding novel failure modes, and designing practical fault-tolerant systems.

lrm318@sjtu.edu.cn

Erci Xu is currently a researcher-at-large at Alibaba Cloud. His research focuses on distributed storage systems and software/hardware reliability.

erc.xec@alibaba-inc.com

Yiming Zhang is a professor at School of Informatics, Xiamen University. His current research interests include operating systems, networking, and distributed storage. He received the China Computer Federation (CCF) Distinguished Ph.D. Dissertation Award in 2011.

sdiris@gmail.com

Fengyi Zhu is currently an algorithm engineer at Alibaba Cloud. Her research interests include AIOPS and distributed storage systems. She received a Master’s degree in Applied Statistics at Peking University and a Bachelor’s degree in Mathematical Statistics at Renmin University of China.

qieqie.zfy@alibaba-inc.com

Zhaosheng is working at Alibaba Cloud as Sensor Staff Engineer focusing on archival storage system and AIOPs. He received a Master degree in Computer Science from Northwestern University.

zzs222916@alibaba-inc.com

Mengtian Wang is currently a software engineer at Alibaba Cloud. He received both a Bachelor’s degree and Master’s degree in Statistics from Renmin University of China. Since 2019, he has joined Alibaba Cloud to work on AIOps for distributed storage system.

mengtian.wmt@alibaba-inc.com

Zongpeng Zhu is working at Alibaba Cloud as the head of server storage hardware stability. He has many years of storage R&D experience and leads the design/development of Alibaba Cloud server storage intelligent operation and maintenance system.

zongpeng.zzp@alibaba-inc.com

Guangtao Xue received a Ph.D. degree from the Department of Computer Science and Engineering at the Shanghai Jiao Tong University in 2004. He is currently a professor in the Department of Computer Science and Engineering at Shanghai Jiao Tong University, China. His research interests include wireless networks, mobile computing, and distributed computing. He is a member of the IEEE and ACM.

xue-gt@cs.sjtu.edu.cn

Jiwu Shu currently works in Xiamen University and Tsinghua University. He received a PhD degree in computer science from Nanjing University, in 1998, and finished the postdoctoral position research with Tsinghua University, in 2000. Since then, his current research interests include storage systems, non-volatile memory based storage systems, storage security and reliability, and parallel and distributed computing. He is a fellow of the IEEE.

shujw@tsinghua.edu.cn

Minglu Li (Fellow, IEEE) received a Ph.D. degree from Shanghai Jiao Tong University, China, in 1996. He is currently a distinguished professor with the College of Computer Science and Technology, Zhejiang Normal University, China and a professor with the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. His main research topics include cloud computing, vehicular networks, and big data processing.

mlli@sjtu.edu.cn

Jiesheng Wu is VP and GM of Alibaba Cloud Storage, leading Product and Engineering for the comprehensive portfolio of storage services, including Block, Object, File, Table, Log, Hybrid Cloud storage and more.

jiesheng.wu@alibaba-inc.com