|
|
» |
|
|
|
We are working to develop closed-loop control techniques to automate resource allocation and service-level management in IT systems and applications. This fits into the line of work at HP Labs that aims at developing model-driven automation technologies for next-generation data centers.
| |
|
This work is motivated by the need to enable IT applications to meet service-level objectives (SLOs) specified by operators in the presence of system and workload changes. Although long-term trends in workloads can be managed through capacity planning, systems need to react in short time-scales to meet unanticipated demands that may happen as a result of system failures elsewhere in the data center or to meet short term spikes in demand.
In addition, in many cases, additional efficiency can be obtained from existing data center resources by appropriately allocating resources to increase utilization. Prior work in this area has largely relied on ad-hoc techniques or statistical and/or optimization methods.
Feedback control and formal control-theory techniques have been developed over the last century to manage physical systems and industrial processes, and a significant body of literature exists on modeling system behavior using mathematical models, and on using those models, along with online measurements, to control systems in real-time. However, applications of control theoretic models have been limited within IT systems, both because such systems are hard to characterize, and because appropriate "actuators" that can exercise dynamic control over the systems were not present (or easy to use).
In this work, we conduct research aimed at understanding how resource allocation and performance tuning for IT applications can be automated using formal control theory-based approaches. In particular, we would like a better understanding of the following issues:
- What are the boundaries for applying control theory to IT systems?
- How to translate typical management problems in IT systems into control problems?
- How sensors and actuators currently available in IT systems can be exploited within a feedback control framework? How do they need to evolve to better manage IT systems?
- How do we model IT systems and applications in general, and specifically how to deal with system nonlinearities and actuator delays and overheads?
- How do different control techniques compare in terms of applicability and effectiveness when used for systems problems?
As a case study, we focus on resource-management issues for virtualized systems that run business-critical IT applications in enterprise data centers. More specifically, we have studied dynamic sizing of resource containers to meet the service-level objectives (SLOs) of one or more applications running on a shared service platform, as application workloads or performance targets vary over time. We have experimented with resource containers that are created using HP Process Resource Manager (PRM) and Virtual Partitions (vPar) on HP-UX systems or Xen virtual machines on Linux systems.
Our key results so far include the following:
- Dynamic allocation of CPU resource to a single resource container to increase resource utilization while meeting application-level performance targets, using adaptive control (DSOM2005), predictive control (NOMS2006), and nested control loops (ACC2006)
- Dynamic allocation of CPU resources to multiple resource containers running a multi-tier application in order to meet application-level performance targets, based on synthesis of a feed-forward transaction-mix-based queueing model and feedback control loops (FeBID2007)
- Dynamic allocation of CPU resources to resource containers on a shared service platform, where multiple multi-tier applications share a common pool of virtualized servers, and each QoS goals in spite of varying application demands (EuroSys2007, download PDF)
As on-going work, we are pursuing the following research directions:
- Dynamic control of multiple resource types to applications in virtualized environments
- Cost associated with dynamic control in terms of capacity and performance overhead and impact on system stability, and how they can be taken into account in control algorithms
- Unified power management framework for data centers, in collaboration with the Smart Power project
- Integrated diagnosis and control of IT systems and applications, in collaboration with the Statistical Learning and Control (SLIC) project
- Xiaoyun Zhu
- Zhikui Wang
- Sharad Singhal
- Xue Liu (UIUC, Intern, 2004, Visiting Scholar, 2006-2007)
- Wei Xu (UC-Berkeley, Intern, 2005)
- Pradeep Padala (Univ. of Michigan, Intern, 2006-2007)
- Marcela Vizcay (Univ. of Chile, Intern, 2007)
- Mustafa Uysal (HP Labs)
- Arif Merchant (HP Labs)
- Partha Ranganathan (HP Labs)
- Vanish Talwar (HP Labs)
- Kang Shin (Univ. of Michigan)
- Ken Salem (Univ. of Waterloo)
» Download the latest Adobe Acrobat Reader
|