Technical Reports

HP Labs


»	Research


»	News and events
»	Technical reports


»	About HP Labs
»	Careers @ HP Labs
»	People
»	Worldwide sites


»	Downloads

Click here for full text:

Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems

Huang, Chengdu; Cohen, Ira; Symons, Julie; Abdelzaher, Tarek

HPL-2006-160R1

Keyword(s): system performance diagnosis; machine learning; transfer learning; scalability

Abstract: Distributed systems continue to grow in scale and complexity, resulting in increasingly more involved interactions among components and increasingly more intricate failure modes that are very hard to diagnose manually. This increased vulnerability of larger systems, together with the increased difficulty of failure diagnosis, has motivated machine learning approaches to automate the diagnosis task. While preliminary encouraging results are achieved, scaling up the existing approaches to large applications remains challenging. With increase in scale, current approaches suffer the curse of dimensionality exacerbated by the exploding set of system states and measured metrics. In this paper, we significantly improve scalability of performance diagnosis methods. Our contributions lie in the use of (i) an intelligent partitioning of the metric space, coupled with a cooperative temporal segmentation algorithm, dividing system observations in time and in space to remove the multiplicative explosion of system states, and (ii) transfer learning techniques that improve accuracy by leveraging dependencies among the partitions. We validate our approaches on several months of production traces from a customer-facing geographically distributed, 24x7, 3-tier internet service. Our results show a significant accuracy improvement (350n average) over the naive partitioning of the state space (without the new temporal segmentation algorithm or transfer learning), and an order of magnitude reduction in computational cost over the .brute force. approach of learning with no partitioning, without loss of accuracy.

14 Pages

Back to Index


»Technical Reports
	»	2009
	»	2008
	»	2007
	»	2006
	»	2005
	»	2004
	»	2003
	»	2002
	»	2001
	»	2000
	»	1990 - 1999



Heritage Technical Reports
	»	Compaq & DEC Technical Reports
	»	Tandem Technical Reports

Printable version


Privacy statement	Using this site means you accept its terms	Feedback to HP Labs

© 2009 Hewlett-Packard Development Company, L.P.

Technical Reports

HP Labs

»Technical Reports

Heritage Technical Reports