Check out the new USENIX Web site. next up previous
Next: Use of OSPFScan for Up: Utility of the Monitor Previous: Utility of the Monitor

   
9.1.1 LSAG in Day-to-day Operations

As mentioned earlier, the LSAG provides two data sources in real-time: messages related to the topology changes and anomalous behavior, and network topology snapshots. Both the sources provide valuable insight into the health of a network.

We have developed a web-site for viewing LSAG messages, interacting with network operators to make the site as simple and user-friendly as possible. The web-site allows the operators to query the LSAG message logs, generate statistics about the messages, and navigate past archives. The web-site makes use of a configuration management tool to map IP addresses into names. This web-site is now used extensively by network support and operations on a regular basis, and has proved invaluable during network maintenance to validate maintenance steps as well as to monitor the impact of maintenance on the network-wide behavior of OSPF.

Network operation groups also use the LSAG messages for generating alarms by feeding them into higher layer alerting systems. This in turn allows correlation and grouping with other monitoring tools. To prevent a deluge of alerts generated due to a high frequency of LSAG messages, we have taken two steps. First, we prioritize messages to help operators in the event of ``too many flashing lights''. For example, the alerting system assigns ``RTR DOWN'' message a higher priority than a ``RTR UP'' message. Second, we group multiple messages into a single alarm. For example, a fiber cut can bring down a number of adjacencies prompting the LSAG to generate several ``ADJACENCY DOWN'' messages. We group all these messages into a single alert to prevent a flurry of alerts for a single underlying event.

Network operators may change OSPF link weights from their design values to carry out maintenance tasks. We have designed a ``link-audit'' web-site that allows operators to keep track of such link weight changes. The web-site makes use of the topology snapshots to display the set of links whose weights differ from the design weights. This allows operators to validate the steps carried out for maintenance. At the end of the maintenance interval, the web-site also allows operators to verify that weights of the affected links are reverted back to their original values.

Below we describe a few specific cases where the LSAG served to identify network problems.

1.
Internal problem in a crucial router: The LSAG identified an intermittent hardware problem in a crucial router in area 0 of the enterprise network [7]. This problem resulted in episodes lasting a few minutes during which the problematic router would drop and re-establish adjacencies with other routers on the LAN. Each episode lasted only for a few minutes and there were only a few episodes each day. The data suggests that during the episodes the network was at the risk of partitioning or was in fact partitioned. During these episodes, a second router failure could have resulted in a catastrophic loss of connectivity. Fortunately, a flurry of ``ADJACENCY UP'' and ``ADJACENCY DOWN'' messages recorded by the LSAG during each episode helped operators identify the problem, and perform preventative maintenance. It is worth noting here that this problem did not manifest in other network management tools being used by the enterprise network.

2.
External link flaps: The LSAG helped identify a flapping external link in the enterprise network [7]. One of the enterprise network routers (call it A) maintains a link to a customer premise router (call it B) over which it runs EIGRP. Router A imports EIGRP routes into OSPF as external LSAs. LSAG messages led to a closer inspection of network conditions, which revealed that the EIGRP session between A and B started flapping when the link between A and B became overloaded. This led to router A repeatedly announcing and withdrawing EIGRP prefixes via external LSAs. The flapping of the link between A and B persisted nearly every day for months between 9 PM and 3 AM. The LSAG messages (``TYPE-5 ROUTE ANNOUNCED'' and ``TYPE-5 ROUTE WITHDRAWN'') helped network operators to identify and mitigate the problem, though they could not completely eliminate it as the operators did not have access to the customer-premise router.
3.
Router configuration problem: In another case, the LSAG helped operators of the enterprise network identify a configuration problem: assignment of the same router-id to two routers. This error resulted in these routers repeatedly originating their router LSAs which showed up as a series of ``ADJACENCY UP/DOWN'' LSAG messages.

4.
Refresh LSA bug: The LSAG helped identify a bug in the refresh algorithm of the routers from a particular vendor in the ISP network. The bug resulted in a much faster refresh of summary LSAs under certain circumstances than the RFC-mandated [1] rate of 30 minutes. The bug was identified due to the ``LSA STORM'' messages generated by the LSAG. At the time of writing this paper, the vendor is investigating the bug. It is worth noting that it would be impossible to catch such a bug with any other class of available network management tools.


next up previous
Next: Use of OSPFScan for Up: Utility of the Monitor Previous: Utility of the Monitor
aman shaikh
2004-02-07