CauseInfer Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Env

Page 1

Cause Infer Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment

Abstract: Modern computing systems especially cloud-based and cloud-centric systems always consist of a mass of components running in large distributed environments with complicated interactions. They are vulnerable to performance problems due to the highly dynamic runtime environment changes (e.g., overload and resource contention) or software bugs (e.g., memory leak). Unfortunately, it is notoriously difficult to diagnose the root causes of these performance problems in a fine granularity due to complicated interactions and a large cardinality of potential cause set. In this paper, we build an automated, black-box and end-to-end cause inference system named Cause Infer to pinpoint the root causes or at least provide some hints. Cause Infer can automatically map a distributed system to a two-layer hierarchical causality graph and infer the root causes along the causal paths in the causality graph. Cause Infer models the fault propagation paths in an explicit way and works without instrumentation to the running production system, which makes Cause Infer more effective and practical than previous approaches. The experimental evaluations in two benchmark systems show that Cause Infer can identify the root causes in a high accuracy. Compared to several state-of-the-art


approaches, Cause Infer can achieve over 10% improvement. Moreover, Cause Infer is lightweight and flexible enough to readily scale out in large distributed systems. With Cause Infer, the mean time to recovery (MTTR) of the cloud systems can be significantly reduced. Existing system: And in these causal relations, it is not allowed that two variables impact each other. Therefore, all the causal relations can be encoded by a Directed Acyclic Graph (DAG). In a DAG, the node represents a particular variable and the edge represents the causal relationship. A rigorous causality graph requires a large number of intervention experiments to investigate who is a cause and who is an effect, namely “do-calculus” proposed, which is intractable in practical systems. Therefore, we rely on observations and assumptions to form a theory of causal interactions. Suppose one change in one variable is regarded as an intervention, the causal relation can be constructed. Notice that we make no differences between “dependency” and “causality” from the viewpoint of “cause-effect” in this paper. Proposed system: Advantages: Before that we give a short introduction of causality. Different from association and correlation, causality is used to represent a direct “cause effect” relation. The formal definition has been described in Pearl’s work. A rigorous causality graph requires a large number of intervention experiments to investigate who is a cause and who is an effect, namely “do-calculus” proposed, which is intractable in practical systems. Therefore, we rely on observations and assumptions to form a theory of causal interactions. Suppose one change in one variable is regarded as an intervention, the causal relation can be constructed. Disadvantages: But one problem stays unresolved. The transmission between client and server is bidirectional meaning that we may get an opposite service dependency when observing in different hosts. For instance, when observing in host 192.168.1.117, we get a service dependency.


To diagnose the performance problems caused by co-located services, we integrate their system metrics in this algorithm. The length of training data is set to 200 in this paper because 200 data points are enough to build a precise causality graph. It is better if given more data. But more data means more computational overhead. Modules: Evaluation Methodology: Due to the lack of real production systems, Cause Infer is only evaluated in a controlled cloud system. However, we believe it works with no differences in realworld systems. The controlled system contains five physical server machines hosting the benchmarks and four client machines generating the workload. Each physical server machine has a 8-core Xeon 2.1 GHZ CPU, 16GB memory and one Gigabit NIC and is virtualized into five VM instances by KVM. Each VM has two vcpu and 2GB memory and runs a 64-bit CentOS 6.2. TPC-W is a transaction processing benchmark which is used to emulate online book shopping. A typical TPC-W benchmark is consisted of three tiers: web service, servlet processing service and database service. In our controlled environment, we set up one Apache Httpd node, four Tomcat nodes and one MySQL node to provide web services, application services and database services respectively and these services run in dedicated VM instances.

Effectiveness Evaluation: Our system strongly relies on the metric causality graph, so we first validate the correctness of the causality graph constructed by the “conservative� algorithm. Figure 10 demonstrates part of the causality graph built for a Httpd service in TPCW benchmark. From this figure, we can see all the relations are reasonable except one: NET TxKBT ! NET RxKBT. Intuitively as a server, the send traffic (NET TxKBT) varies along the receive traffic (NET RxKBT). But we get an opposite result here. The most possible reason is that our system runs in a close loop which means the workload generator issues a new request only when it gets response from the Httpd server. And it is a factor to bias our result. But in real-world systems, this phenomenon is scarce as these systems are open.


Comparison To evaluate the effectiveness of Cause Infer more comprehensively, we compare Cause Infer with several previous approaches. 1) Tree Augment Bayesian Network (TAN), TAN is adopted to diagnose performance problems . For comparison, we substitute PC-algorithm with TAN to construct the causality graph; 2) Net Medic, although the original approach is designed to infer the faulty components based on a history-based gauge, a kind of “correlation� to some extent, it can also be used to build the metric dependency graph. We compare it with our system in both of component level and metric level diagnosis; 3) PAL, it pinpoints the faulty components in distributed applications by extracting anomaly propagation patterns with change point correlation; 4) FChain, it shares the same idea as PAL even though it leverages a new anomaly detection method. To reduce the bias of diagnosis result due to the implementation deviation, we guarantee the injected faults can make significant SLO violations.

Scalability: Most of the computation including causality graph construction and cause inference is done with local information. Thus, our system is readily to scale out in large distributed systems. To validate its scalability, we add more services (e.g., Tomcat) to the system. Moreover, we qualitatively compare the diagnosis result and execution time between Cause Infer and other systems in more realistic environments. Although a length of 200 data is sufficient to build a causality graph in Cause Infer, it is not the case for TAN and Net medic. To conduct fair comparisons, we leverage a length of 1800 data (i.e., 5 hours’ data) to construct causality graphs with Cause Infer, TAN, and Net medic respectively. Figure 19 demonstrates the scalability test results of Cause Infer, TAN, and Net medic at a metric-level diagnosis in TPC-W when only one fault is injected at each time. From the results, we observe a negligible degradation in Precision and Recall when the number of services goes from 4 to 40.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.