SAGE - Sage feature

Enough SNMP to Be Dangerous, Part 3

zwicky_elizabeth

by Elizabeth Zwicky
<zwicky@greatcircle.com>

Elizabeth was a founding member of SAGE and is currently on the USENIX Board of Directors.



Note: This web page does not have a set fixed width. Please make your browser window wider if you find the code section lines wrapping.

This is a series of articles dedicated to teaching typical UNIX system-administrator types—people who can compile public-domain software and have some idea about TCP/IP—how to do UNIX-style hackery with SNMP. It is not an elegant and systematic approach to SNMP, but it should give you enough background to be dangerous.

In part 1 (;login:, December 1998), we discussed the basics of SNMP and SNMP tools; in part 2 (;login:, August 1999), we turned these into a little tool for resolving "my uptime is better than yours" arguments. In part 3, we move on to counters and less frivolous uses of SNMP.

We've wandered through the system group of MIB-II, but there are a bunch of other mandatory groups we can look at on any SNMP device. Of those, my favorites are interface status and statistics information. For each network interface on the device, you can find out handy facts like how many packets it has received, how many of them were bad, and how many it had to discard. These are the sorts of counters that fancy network-administration programs use in order to do elegant things. They can, of course, be used for much stupider purposes. Unfortunately, there is a catch.

Counters in SNMP are supposed to behave like the odometer in a car. They go up. When they get to the top, they start over at 0. They are not actually supposed to reset on reboot (or at any other time), except when otherwise specified. So what happens if you see numbers like the following pair?

Input unicast packets: 1,000
Input unicast packet errors: 500,000

Well, there are three obvious possibilities.

  1. This interface is on a remarkably broken network.

  2. This device has gotten terribly confused.

  3. The input unicast packet counter has rolled over, but the error counter hasn't; the right number is in fact 4,294,967,295 + 1,000

    Nothing within SNMP will disambiguate these three cases. Furthermore, if the problem is that the device is confused, and the device is actually SNMP-compliant, there's nothing you can do about it — there is no way to reset the counters. The official position on this is that the actual counter values aren't meaningful; it's the rate of change of the counter values that is meaningful. This is great if you're running network-management software that sits around monitoring things continuously, but less than useful if you're just poking at things once in a while.

    Fortunately or unfortunately, compliance with this part of the SNMP spec is just as good as compliance with any other piece. The devices I've looked at fall into three classes. My favorite ones reset the counter on reboot, which completely confuses continuous network-management software, but works well for me. My second favorites are the ones that actually implement the spec; it's somewhat annoying, but I understand the reasoning, and at least I know what to expect. The ones that truly annoy me are our Octel voicemail machines, which have counters that not only fail to reset—they also fail to roll over. This would be marginally less annoying if they counted up to the full value of an SNMP counter (232 - 1), but instead they only count up to 65,535 (216 - 1). As a result, most of the more interesting ones pegged their counters within a week or so.

    In practice, what I do about these counter issues is ignore them. I'm not writing professional SNMP management tools; I'm doing quick-and-dirty network debugging. I have a program to calculate error rates, and then I look at them. If they're insane, I apply obvious human-mediated debugging techniques to sort out the three possible reasons. (For instance, does the network work at all? If so, then there is not a 500% error rate. Does the device return equally bizarre information if queried otherwise? If so, then perhaps it's confused, and rebooting it will improve the world.)

    With that in mind, let's keep working through MIB-II. We've pretty thoroughly mined the system; the remaining parts of MIB-II are:

     interfaces
     at
     ip
     icmp
     tcp
     udp
     egp
     transmission
     snmp

    First, let's decide to ignore some of these forever. "at" is the address translation group, which is officially deprecated because its meaning is entirely dependent on the protocols your device happens to be speaking. If TCP/IP is the device's favorite protocol, the "at" group is basically an exceedingly annoying way of representing the arp cache.

    The "transmission" group has data about the transmission media underlying the interfaces, if it has anything at all, which doesn't happen all that often. You probably don't care; I certainly don't.

    The "egp" and "icmp" groups have information about their respective protocols, if the device implements them. Once again, this is all very well, but they're just not very interesting protocols for most purposes. The "snmp" group is one of the best examples, outside of particle physics, of the Heisenberg effect, whereby observing something changes the value observed. Of course if you send an SNMP get command to get the value of snmp.snmpInPkts, which is the number of SNMP packets received, you increment the counter. Aside from this minor and recondite pleasure, the snmp group doesn't have many applications if you track how many requests you make to a machine, you can see if anybody else is playing with SNMP, but tracking them down is a separate and thornier problem.

    That leaves us with the apparently useful "interfaces," "ip," "tcp," and "udp" groups. Here's a walk through part of the interfaces group, to illustrate how SNMP tables work:

    interfaces.ifNumber.0 = 2
    interfaces.ifTable.ifEntry.ifIndex.1 = 1
    interfaces.ifTable.ifEntry.ifIndex.2 = 2
    interfaces.ifTable.ifEntry.ifDescr.1 = Silicon Graphics ec Ethernet controller
    interfaces.ifTable.ifEntry.ifDescr.2 = Silicon Graphics lo Loopback interface
    interfaces.ifTable.ifEntry.ifType.1 = ethernetCsmacd(6)
    interfaces.ifTable.ifEntry.ifType.2 = softwareLoopback(24)
    interfaces.ifTable.ifEntry.ifMtu.1 = 1500
    interfaces.ifTable.ifEntry.ifMtu.2 = 8304
    interfaces.ifTable.ifEntry.ifSpeed.1 = Gauge: 10000000
    interfaces.ifTable.ifEntry.ifSpeed.2 = Gauge: 200000000
    interfaces.ifTable.ifEntry.ifPhysAddress.1 = 8:0:69:2:f6:ff
    interfaces.ifTable.ifEntry.ifPhysAddress.2 =
    interfaces.ifTable.ifEntry.ifAdminStatus.1 = up(1)
    interfaces.ifTable.ifEntry.ifAdminStatus.2 = up(1)
    interfaces.ifTable.ifEntry.ifOperStatus.1 = up(1)
    interfaces.ifTable.ifEntry.ifOperStatus.2 = up(1)
    interfaces.ifTable.ifEntry.ifInOctets.1 = 2081101543
    interfaces.ifTable.ifEntry.ifInOctets.2 = 31835092
    interfaces.ifTable.ifEntry.ifInUcastPkts.1 = 3224161
    interfaces.ifTable.ifEntry.ifInUcastPkts.2 = 500898
    interfaces.ifTable.ifEntry.ifInNUcastPkts.1 = 926910
    interfaces.ifTable.ifEntry.ifInNUcastPkts.2 = 0

    interfaces.ifNumber is a familiar, single-instance variable that tells how many interfaces the machine has. You were probably assuming that we'd been using "0" because SNMP counts starting at 0. As you can see, this is false. Actually, if there's anything to count, it is not allowed to start below 1. (In most cases, it will start at 1, but trusting in anything is unwise with SNMP.) And it gets worse — check this out:

    interfaces.ifNumber.0 = 15
    interfaces.ifTable.ifEntry.ifDescr.1 = Serial0/0
    interfaces.ifTable.ifEntry.ifDescr.2 = Serial0/1
    interfaces.ifTable.ifEntry.ifDescr.3 = Serial0/2
    interfaces.ifTable.ifEntry.ifDescr.4 = Serial0/3
    interfaces.ifTable.ifEntry.ifDescr.5 = Serial0/4
    interfaces.ifTable.ifEntry.ifDescr.6 = Serial0/5
    interfaces.ifTable.ifEntry.ifDescr.7 = Serial0/6
    interfaces.ifTable.ifEntry.ifDescr.8 = Serial0/7
    interfaces.ifTable.ifEntry.ifDescr.9 = Ethernet1/0
    interfaces.ifTable.ifEntry.ifDescr.10 = Ethernet1/1
    interfaces.ifTable.ifEntry.ifDescr.11 = Ethernet1/2
    interfaces.ifTable.ifEntry.ifDescr.12 = Ethernet1/3
    interfaces.ifTable.ifEntry.ifDescr.13 = FastEthernet2/0
    interfaces.ifTable.ifEntry.ifDescr.14 = FastEthernet2/1
    interfaces.ifTable.ifEntry.ifDescr.23 = Serial0/7.110

    That's right, it has 15 interfaces, numbered 1 through 14, and 23. That's OK. It's allowed to do that. Of course, if you're trying to loop through all the interfaces, this makes life unpleasant. Fortunately, SNMP allows you to do a "get next." A get next on interfaces.ifTable.ifEntry.ifDescr.0 (which, you will note, doesn't exist, and is guaranteed not to) returns interfaces.ifTable.ifEntry.ifDescr.1 and its value. If you have a handy indicator like interfaces.ifNumber, you can "get next" the appropriate number of times. Otherwise, you may just have to keep going until the next object is either an error or something in another part of the tree.

    So here's a version of a program I've actually used to debug network problems:

    [ASCII text version of following program code]

    #!/usr/bin/perl5
    #
    # tcpprobs
    # Elizabeth D. Zwicky
    # zwicky@sgi.com
    # July 1998

    use CGI qw(:all);

    use SNMP;

    # This turns on formatted printing of variables
    $SNMP::use_sprint_value = 1;

    $ORANGE_THRES = 5;
    $RED_THRES = 20;

    print header;
    print start_html(-title=>"TCP/IP error rates",
       -bgcolor=>"ffffff");
    print h1("TCP/IP error rates");

    # Up to you to figure out how to get this set as a parameter;
    # the elegant way is to write up a form, but you could always just
    # hand-type it as part of the URL, as in
    # http://yourhost/tcpprobs?hostname=hosttocheck
    $hostname = param('hostname');

     if ($sess = new SNMP::Session(DestHost=>"$hostname")){

     # First we pull a bunch of nice, straightforward single-instance
     # variables.

     $tcpout = $sess->get(["tcp.tcpOutSegs", "0"]);
     $tcpretrans = $sess->get(["tcp.tcpRetransSegs", "0"]);

     $ipin = $sess->get(["ip.ipInReceives", "0"]);

     $ipinheader = $sess->get(["ip.ipInHdrErrors", "0"])
     $ipinaddr = $sess->get(["ip.ipInAddrErrors", "0"]);
     $ipdiscard = $sess->get(["ip.ipInDiscards", "0"]);

     print h3("$hostname");
     print p("TCP: $tcpout packets out, $tcpretrans (".
       &ppercent($tcpretrans, $tcpout).
       " percent) TCP retransmission errors <br>"
       );
     print p("IP: $ipin packets received, $ipinhdr (".
       &ppercent($ipinhdr, $ipin).
       " percent) header errors, $ipinaddr (".
       &ppercent($ipinaddr, $ipin) .
       " percent) address errors"
       );

     # And then we wander off into manipulating tables and multiple
     # instances...

     # This is the number of interfaces on the machine

     $interfaces = $sess->get(["interfaces.ifNumber", "0"]);
     print "<table border = 2>\n";

     

     print TR (
       th('Interface'), th('Adm. Stat.'), th('Op. Stat.'),
     th('&nbsp'),
       th('Input Packets'), th('Input Errors'), th('Input Discards'),
       th('&nbsp'),
       th('Output Packets'), th('Output Errors'), th('Output Discards')
       );

    # And now we loop

    foreach $index (0..($interfaces - 1)){
     $interface =
      $sess ->
       getnext(["interfaces.ifTable.ifEntry.ifIndex" ,
        $index]);
     $descr =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifDescr",
        "$interface"]);
     $admin =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifAdminStatu s",
        "$interface"]);
     $oper =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifOperStatus ",
        "$interface"]);
     $unknown =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifInUnknownP rotos",
        "$interface"]);
     $input =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifInNUcastPk ts",
        "$interface"]);
     $input +=
      $sess->
       get(["interfaces.ifTable.ifEntry.ifInUcastPkt s",
        "$interface"]);
     $inerrs =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifInErrors",
        "$interface"]);
     $indisc =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifInDiscards ",
        "$interface"]);
     $output =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifOutNUcastP kts",
        "$interface"]);
     $output +=
      $sess->
       get(["interfaces.ifTable.ifEntry.ifOutUcastPk ts",
        "$interface"]);
     $outdisc =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifOutDiscard s",
        "$interface"]);
     $outerrors =
      $sess->
       get(["interfaces.ifTable.ifEntry.ifOutErrors" ,
        "$interface"]);
     print TR (
      td($descr), td($admin), td($oper),
      td('&nbsp'),
      td($input),
      td("$inerrs (" . &ppercent($inerrs, $input) . ")"),
      td("$indisc (" . &ppercent($indisc, $input) . ")"),
      td('&nbsp'),
      td($output),
      td("$outerrs (" . &ppercent($outerrs, $output) . ")"),
       td("$outdisc (" . &ppercent($outdisc, $output) . ")"),
      );
     }

     print "</table>\n";

    }

    else {
     print p(b("Could not bind to $hostname: $!"));
    }

    print end_html;

     sub ppercent {
      my($num) = $_[0];
      my($denom) = $_[1];

      if ($denom <= 0 ){
       return 0;
      }
      else {
       my($percent) = ($num * 100)/ $denom;
       if ($percent > $RED_THRES){
        return sprintf("<font color=red>%3.2f%%</font>", $percent);
       }
      elsif ($percent > $ORANGE_THRES){
       return sprintf("<font color=orange>%3.2f%%</font>", $percent);
      }
      else {
       return sprintf("%3.2f%%", $percent);
      }
     }
    }

    There are a few tricks here that we haven't already discussed. The "ip" group gives me separate numbers for "UcastPkts" (unicast packets) and "NUcastPkts" (non-unicast packets, i.e., multicasts and broadcasts). For my purposes, this is irrelevant, so I add them together.

    There's also this unpleasant-looking result:

    TCP: Wrong Type (should be Counter): NULL packets out, Wrong Type (should be Counter): NULL (0 percent) TCP retransmission errors

    IP: Wrong Type (should be Counter): NULL packets received, (0 percent)

    header errors, Wrong Type (should be Counter): NULL (0 percent) address errors

    That's a UNIX machine that doesn't keep these statistics in its kernel and therefore doesn't feed them to its SNMP agent, running into a combination of beautiful error handling (the library's) and completely laissez-faire error nonhandling (mine). You could make the output more beautiful, but you can't get blood out of a stone, or TCP retransmission statistics out of a machine running IRIX 5.3's default SNMP agent.

    You may wonder how I picked the precise variables I show here. It's clear even from the excerpts I've shown that these are not all the variables that are available to me. This is more or less pure empiricism; I started with a program that displayed pretty nearly everything and got rid of all the ones that I never actually needed, until the remaining information fit well on a page. Depending on your point of view, this is either science at its best or hackery at its worst.

    Next: Exploring MIBs on your own; we discover device-specific MIBs.


?Need help? Use our Contacts page.
Last changed: 29 Nov. 1999 jr
Issue index
;login: index
SAGE home