Moises Goldszmidt, Ira Cohen
Hewlett-Packard Labs
Palo Alto, CA
Armando Fox, Steve Zhang
Computer Science Department
Stanford University
Recent research
activity [
2,
12,
27,
10,
1]
has shown encouraging results for performance debugging and
failure diagnosis and detection in systems by using approaches
based on automatically inducing models and deriving correlations
from observed data. We believe that maximizing the potential of this
line of research will require surmounting some fundamental challenges
arising not from the modeling techniques themselves, but specifically
from the
application of those techniques to real-world systems.
We specifically formulate three challenges. First, as new data is
collected from a system, previously-induced models must be continuously
assessed and validated, with the ultimate aim of
achieving online adaption to system changes. Second, human operators
must be able to effectively interact with the models,
including interpreting model findings to generate explanations, enabling
human feedback to improve the models, and identifying false positives
and missed detections. Third, it should be possible to formally
manipulate ``signatures'' of system state,
allowing us to query the system's past to identify
recurring problems and manually annotate them with
additional information.
We contend that the specifics of this problem domain not only raise
these challenges, but also provide the knowledge base from which to
derive well-engineered solutions to them.