Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets

Rahul Potharaju; Navendu Jain; Cristina Nita-Rotaru

Authors:

Rahul Potharaju, Purdue University; Navendu Jain, Microsoft Research; Cristina Nita-Rotaru, Purdue University

Abstract:

This paper presents NetSieve, a system that aims to do automated problem inference from network trouble tickets. Network trouble tickets are diaries comprising fixed fields and free-form text written by operators to document the steps while troubleshooting a problem. Unfortunately, while tickets carry valuable information for network management, analyzing them to do problem inference is extremely difficult—fixed fields are often inaccurate or incomplete, and the free-form text is mostly written in natural language.

This paper takes a practical step towards automatically analyzing natural language text in network tickets to infer the problem symptoms, troubleshooting activities and resolution actions. Our system, NetSieve, combines statistical natural language processing (NLP), knowledge representation, and ontology modeling to achieve these goals. To cope with ambiguity in free-form text, NetSieve leverages learning from human guidance to improve its inference accuracy. We evaluate NetSieve on 10K+ tickets from a large cloud provider, and compare its accuracy using (a) an expert review, (b) a study with operators, and (c) vendor data that tracks device replacement and repairs. Our results show that NetSieve achieves 89%-100% accuracy and its inference output is useful to learn global problem trends. We have used NetSieve in several key network operations: analyzing device failure trends, understanding why network redundancy fails, and identifying device problem symptoms.

Rahul Potharaju, Purdue University

Navendu Jain, Microsoft Research

Cristina Nita-Rotaru, Purdue University

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {180300,
author = {Rahul Potharaju and Navendu Jain and Cristina Nita-Rotaru},
title = {Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets},
booktitle = {10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13)},
year = {2013},
isbn = {978-1-931971-00-3},
address = {Lombard, IL},
pages = {127--141},
url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/potharaju},
publisher = {USENIX Association},
month = apr
}

Download

Potharaju PDF

View the slides

Presentation Video

Presentation Audio

Download Audio

Public Summary:

by Kobus Van der Merwe

In dealing with network problems, network operators often make use of so-called "trouble tickets." These trouble tickets get created when the network problem is first identified and then serve as a means to document the investigation into the problem, and, hopefully, the resolution thereof. As such, trouble tickets contain a wealth of information about the network and the problems it experiences over time. Trouble tickets are often structured documents, with a certain number of fixed fields identifying, for example, the time of the event and the manner in which the trouble was detected. Unfortunately, these fixed fields are typically of limited use, and the most useful information about the trouble and its resolution is captured in free text form by the human operators "working the trouble." Extracting useful information from trouble tickets is therefore an extremely challenging problem.

In "Juggling the Jigsaw," the authors present a system that does an automated analysis of network trouble tickets to infer what the underlying problem was. Specifically, the goal of the work is to allow operators to see and understand global trends in the network, rather than making decisions based on isolated incidents. The authors developed a system that makes use of natural language processing techniques to extract patterns and knowledge from a "learning set" of trouble tickets. During this learning phase they make use of domain experts to develop an ontology, which is stored in a knowledge base. The resulting knowledge base is then used to perform problem inferences against trouble tickets. During this operational phase, the system also allows for incremental learning based on feedback from domain experts.

The authors do a very nice job of explaining their approach by showing intermediate results and specific examples from their data set along the way. The system is evaluated by comparing its problem inferences of two sets of tickets with "ground truth" obtained through manual labeling of the same tickets by domain experts. The system achievedhigh accuracy for both data sets. The system has also been deployed in a cloud-provider environment, and the authors present anecdotal evidence of the utility of their system in this environment.

connect with us