Pp. 119-130 of the Proceedings |
Thresh is a simple SNMP [1] monitor, written in Scotty [2] (Tcl [3] with the Tnm extensions), which uses the UNIX file system hierarchy for configuration and data storage. Thresh compares SNMP variables to per-device thresholds or values, and issues notifications if any current SNMP variable values are unacceptable (or unexpected). Thresh can be used by itself, or as a complement to other network management and monitoring tools. Thresh can be thought of as fitting in between tools that generate immediate emergency alerts (such as Big Brother [4]) and trending and history tools (such as Cricket [5]).
Virtually every computing system and network has some kind of monitoring mechanism in place these days, but in some cases the monitoring system consists only of users phoning and asking if there is something wrong with the network. For those interested in something a little more advanced or automatic, there are quite a few software packages available that do some form of monitoring. They range from the very simple (ping tests, etc.) to the very complex (large commercial packages that map networks, configure devices, and make your lunch), with various alternatives in between.
In the realm of ``simple'' monitoring software, packages can generally be divided into two types:
A couple of years ago it occurred to me that there was another class of monitoring that didn't seem to be very well addressed - low to medium priority tracking of certain parameters and their values on computing systems and network devices. For example, you might want to track configuration changes, system or device reboots, or network interface status or change time. These are things that you may want to know, but which don't necessarily indicate an immediate problem and which may not be worth waking anyone up to investigate.
Thresh was created to provide this kind of monitoring. It tracks SNMP variables, compares them to threshold, pre-set, or last-observed values, and reports (typically via email) unexpected changes or out of range values.
I've become convinced that, ugly as it may sometimes seem, the Simple Network Management Protocol (SNMP) [1] should be the basis for just about every monitoring system. Virtually every network connected device either comes with an SNMP agent or can run one with only a modest amount of effort. Most SNMP agents can provide a vast amount of (usually) useful information, and most agents for general purpose computers can be extended to provide just about any data that you might want to have.
Some monitoring systems (such as Big Brother and Spong) rely, to a greater or lesser extent, on separate client agents, running on each system that needs to be monitored, with a system-specific reporting protocol between the agent and the management station. This approach can be somewhat limiting (it's hard to use on things like networking equipment for example), can result in some duplication of services (if you need an SNMP agent for other purposes), and limits both what you can do (they're often not extensible), and where (as your firewall may not be able to pass the particular protocol implemented by the software). Thresh avoids these kinds of problems by using only SNMP for communication. [Note 1]
It's often useful to be able to track certain SNMP variables, but you don't always need to know about changes immediately - it's often good enough to hear about them the next time you read your mail. For example, the system.sysUpTime.0 SNMP variable gets reset every time a system or device gets rebooted - if you get notified every time that variable resets, you'll know if you've got a device reliability problem (or an extension cord that people keep tripping over). Similarly, you can generate disk capacity threshold warnings, network bandwidth warnings, network interface up/down notices, and so on.
With many monitoring systems, this kind of low-priority information can be hard to generate. Many of the most common freely available packages tend to have only a small number of notification or alert mechanisms, and are built with the idea that you're monitoring vital services and networks - sometimes everything is assumed to be an emergency.
Thresh was created to address this kind of medium-level monitoring need.
There are a number of commercial monitoring systems available. Some, such as Spectrum [9], HP OpenView [10], and NetCool [11], are widely deployed and very well respected, and most of them can do (or can be made to do) most or all of what thresh does. However, the commercial packages tend to require a much larger monitoring infrastrucutre, and a much larger committment of time and money to implement, operate, and maintain. For small to medium sized sites, small, simple monitoring tools, like thresh, are often the best choice.
Thresh was implemented in Tcl [3], using the Tnm network management extensions provided by the Scotty/Tkined [2] software. The Tnm extensions are a toolkit of Tcl procedures that make it easy to perform SNMP operations.
Tcl was chosen because it is well suited to this kind of task, and the Tnm extensions provided just the right functionality. At the time thresh was first contemplated, the SNMP modules for Perl were relatively primitive when compared to Tcl and Tnm.
Thresh has benefitted from the use of a scripting language, and the string and array manipulation routines provided by Tcl. Tcl allows the use of an interactive ``shell'' for testing and development. Being a traditionalist, I tended to develop incrementally, using the well-known edit, run, repeat cycle, rather than a more ``modern'' approach to program development. Thresh is currently about 700 lines of Tcl.
In retrospect, I'm still glad to have chosen Tcl in preference to Perl, C, awk, or Visual Basic.
One of the design and implementation goals for thresh was simplicity. That goal has been addressed in the following ways:
Thresh's configuration is ``data-directed'' [Note 3] - it is configured using a hierarchy of directories and configuration files that is intended to reflect organizational structure and DNS naming conventions.
The default configuration assumption is that each directory in the
configuration hierarchy represents an element of the DNS name of the
devices being monitored. For example, the sub-directory named
com/whizbang/admin/printer1 would usually contain the thresh
configuration for the device with the DNS name
printer1.admin.whizbang.com. The name configuration directive
makes it easy to override this default behaviour, by setting the DNS
domain or node name associated with a particular directory.
Configuration Variables
Each directory may contain a DEFAULTS file, which sets the
various configuration variables (such as name,
notifier, delay, community, etc.) which
control thresh's behaviour. Thresh configuration variable names are
case sensitive, and all include only lower case letters. Settings in a
DEFAULTS file are in effect for that node and those lower in
the hierarchy, unless overridden by a lower DEFAULTS file. [Note 4]
Thresh configuration variables can also be set on the command line, in
which case they override any other settings for the given variables.
Figure 1 shows a sample DEFAULTS file.
verbose = true
name = mydomain.net
community = hello
mib = /usr/local/mibs/ascend.mib
# big network, long timeout
timeout = 20
notifier = threshmail jsellens
syslog = local1.info
Thresh's configuration variables include:
ignore = *for example, ends up ignoring the data hierarchy below a given point.
All other files in a directory are expected to contain a list of SNMP variables to monitor for that particular device, with comparison indicators and expected or threshold values. Thresh currently lacks a file inclusion mechanism, but the use of symbolic links makes it easier to manage the configurations for multiple, similar devices. A sample configuration file is shown in Figure 2.
# this is a comment S system.sysDescr.0 S system.sysContact.0 I system.sysUpTime.0 C interfaces.ifTable.ifEntry.ifDescr.5 C interfaces.ifTable.ifEntry.ifAdminStatus.5 C interfaces.ifTable.ifEntry.ifOperStatus.5 G ucdavis.memory.memTotalReal.0 90000 G enterprises.ucdavis.memory.memAvailReal.0 3000 L loadTable.laEntry.laLoad.1 1.20 L loadTable.laEntry.laLoad.2 1.50 L loadTable.laEntry.laLoad.3 2.00 V snmp.snmpInPkts.0
The SNMP variable names used in a configuration file can be any string that Scotty will recognize as a particular MIB variable. They can be fully-qualified names, such as iso.org.dod.internet.mgmt.mib-2.system.sysUpTime.0, unique substings with common prefix elements removed, as in system.sysUpTime.0, or SNMP object identifiers (OIDs), such as 1.3.6.1.2.1.1.3.0. OIDs aren't always the best choice, as they are typically somewhat more cryptic for the casual reader.
The first letter on each configuration line, C, G,
I, L, S, or V, indicates the
comparison to be made:
The final field on some lines is the threshold value to compare against - a threshold value is required for G and L, optional for S, and not allowed for C, I and V. If a value for an S comparison is not provided in a configuration file, the first value for that SNMP variable retrieved from the device is saved and used as the ``normal'' value for the variable.
The configuration mechanism has proven to be quite flexible and easy enough to deal with, though some form of file inclusion mechanism would make some configurations simpler to create and maintain.
Thresh provides a flexible mechanism for notifications. For each device described in the configuration hierarchy, if thresh determines that something needs to be reported, it formats a message and pipes it into whatever ``notifier'' program has been specified by the configuration variables. Thresh also keeps a copy of the message for internal reference when it is next run. If thresh observes the same problems (i.e., an identical notification message) the next time it is run, it will only send another notification if the specified frequency time has passed. If it sees a different set of problems when next run, it will report the complete current set of problems, regardless of whether or not the normal delay time has passed.
If the syslog variable is set, thresh will also generate
syslog messages for SNMP variables that are out of range, in
addition to the normal notifications. Messages formatted for
syslog are deliberately terse, and include the node name,
comparison type, SNMP variable, the normal or threshold value, and the
current value.
Status, History and Logging
For each device in the hierarchy, thresh will create a configuration subdirectory named .thresh, as a convenient location to store the data it generates. Thresh maintains status, history and logging information, both for internal use and to make it possible to review past changes in state. It tracks the previous values of the SNMP variables (in .thresh/status_*), the last notification message sent (in .thresh/last_complaint), the date and time of last contact with the device (in .thresh/last_response), and a log of when variables were found to be outside their ``normal'' ranges (in .thresh/log_*).
The status and log files are named for the configuration files that they are related to, with a prefix of status_ or log_ added to the configuration file's name.
Currently, there are no really interesting ways to access and use this data - you're more or less stuck with using some paginator program to view the files. I should note that there is no built-in mechanism for log file rotation.
Thresh can be used effectively as a standalone, isolated
monitoring tool, but it can also be integrated with other logging or
reporting systems. Thresh's core functionality is polling SNMP
variables, comparing against pre-determined thresholds, and generating
messages for distribution. Integration with other tools can be
accomplished in two ways:
Syslog
Thresh can be configured to record out of range values to
syslog, which provides an easy interface to any existing
syslog watcher.
Custom Notifiers
By setting the notifier configuration
variable, thresh's alert messages can be trivially piped through
arbitrary custom processes, that can record, mail, or dispatch as
appropriate.
The message format is fairly consistent, simple, and relatively
easy to parse. Additionally, the internal thresh code which
actually generates the messages would be easy to change if a
specific output format was required. [Note
5]
Scalability
Thresh is not arbitrarily scalable to huge numbers of devices and variables being monitored. Beyond a certain point, you will start to run into problems such as:
Installation of thresh is very straightforward.
thresh topdir=/path/to/data
The last step is admittedly somewhat more complicated and time-consuming than the others, but that is pretty much unavoidable. The distribution will include some sample configurations.
There are a number of enhancements to thresh that should probably be made:
Thresh seems to serve to illustrate a few useful lessons:
Some of these also illustrate problems, the most obvious being that you probably need more than one monitoring tool, since no one tool is likely to do everything that you need done.
Generally, thresh has proven useful and fits nicely into an effective monitoring toolkit.
Thresh is ``freely'' available through https:// thresh.sourceforge.net/ or https://www.generalconcepts.com/ resources/.
John Sellens is the General Manager for Certainty Solutions in Canada, based in Toronto. (Certainty Solutions was previously known as GNAC - Global Networking and Computing.) Prior to joining Certainty Solutions, he was Director of Network Engineering at UUNET Canada, and was a system administrator at the University of Waterloo for 11 years. He has a master's degree in Computer Science from the University of Waterloo, is a Chartered Accountant, and is a semi-regular contributor to ;login:. John, Joanne, and their delightful children live in Unionville, Ontario. Contact him at jsellens@certaintysolutions.com.
[1] J. Case, M. Fedor, M. Schoffstall, and J. Davin, A Simple
Network Management Protocol (SNMP), Network Working Group, May
1990. RFC 1157, STD 15
[2] J. Shönwälder and H. Langendörfer, ``Tcl Extensions
for Network Management Applications,'' in Tcl/Tk Workshop,
pp. 279-288, USENIX and Unisys, Inc., Toronto, Canada, July 6-8,
1995.
[3] John K. Ousterhout, ``Tcl: An Embeddable Command Language,'' in
USENIX Conference Proceedings, pp. 133-146, USENIX,
Washington, D.C., January 22-26, 1990.
[4] Sean MacGuire and Robert-Andre Croteau, Big Brother FAQ.
https://www.bb4.com/
[5] Jeff R. Allen, ``Driving by the Rear-View Mirror: Managing a
Network with Cricket,'' in First Conference on Network
Administration (NETA '99), pp. 1-10, USENIX, Santa Clara,
California, April 7-10, 1999.
[6] Vikas Aggarwal, NOCOL - Network Operation Center On-Line.
https://www.netplex-tech.com/ software/nocol/
[7] Stephen L. Johnson, Spong - Systems and Network Montoring.
https://spong.sourceforge.net/
[8] Tobias Oetiker, ``MRTG - The Multi Router Traffic Grapher,'' in
Twelfth Systems Administration Conference (LISA '98), p.
141, USENIX, Boston, Massachusetts, December 6-11, 1998.
[9] Aprisma Technologies Inc., Spectrum Network Monitoring and
Management System. https:// www.aprisma.com/
[10] Hewlett-Packard Company, OpenView Monitoring and Management
Software. https://www. openview.hp.com/
[11] Micromuse Inc., Netcool Monitoring and Reporting Suite.
https://www.micromuse.com/
[12] University of California, Davis, UCD-SNMP distribution.
https://ucd-snmp.ucdavis.edu/
NAME
thresh - a data-directed SNMP threshold
poller
SYNOPSIS
thresh
[ varname=value ... ]
DESCRIPTION
thresh
is a data-directed SNMP threshold poller, and uses the file
system for configuration. status, and logging. Each host or
device to be monitored is configured in a separate directory,
using files listing SNMP variables, values, and a comparison
indicator.
In normal operation, thresh starts scanning a data
hierarchy (as described in theshdata (5)) at a particular
directory (set by the topdir variable), reading
DEFAULTS files, variable files, querying hosts and
devices, and recording and reporting the results.
thresh variables, as described in threshvars
(5), and set in DEFAULTS files or on the command line,
change thresh 's default behaviour and notification
mechanisms. Any varname=value settings on the command
line override both the built-in defaults and the settings in any
DEFAULTS files encountered during processing.
thresh would typically be called periodically by
cron (8).
EXAMPLES
In normal use: % thresh To use a non-default start directory: % thresh topdir=/some/other/place To
traverse the data hierarchy and provide information on what would
normally be queried: % thresh walkonly=true To do almost nothing:
% thresh 'ignore=*'
NOTE
thresh
is written in Tcl (n), using scotty (1) and the
Tnm (n) network management extensions.
FILES
thresh
uses just about any file and directory names. The name
.thresh is reserved for naming the subdirectories used by
thresh to store status and logging information.
Any files matching .thresh/log_* are log files, which will
grow without bound, and which you should arrange to rotate,
archive, or truncate periodically.
BUGS
thresh
currently only works with SNMP V1. The logging to syslog
(3) should be internalized in some way, and not depend on
logger (1). There should be some mechanism for
"including" one file from another, to reduce the dependance on
symbolic links for sharing files. thresh is unlikely
to scale to handle arbitrarily large networks.
SEE ALSO
threshdata (5)
threshvars (5)
scotty (1)
Tcl (n)
Tnm (n)
AUTHOR
John Sellens
NAME
threshdata - thresh data hierarchy
description
DESCRIPTION
The thresh(1) SNMP poller uses a
configuration hierarchy to direct its actions, maintain its
status information, and store its logs.
Each directory in the configuration hierarchy (under the
topdir directory) is assumed to relate to a network host
or device, or to an intermediate name in a DNS naming hierarchy.
NAMING
By default, the topdir directory is
assumed to refer to a device named "" - the empty string. Each
directory below topdir normally adds one more element on
the right hand end of a DNS name. For example, below
topdir , the directory org/usenix/conference is related to
the DNS sub-domain "conference.usenix.org", and the directory
org/usenix/conference/ts1 is related to the device
"ts1.conference.usenix.org". This naming relation can be
overridden by the use of the name variable.
FILES
Each directory may contain a DEFAULTS
file, which contains variable settings (see threshvars
(5)) that apply to that directory, and to all directories below
that point, unless overridden on the command line or by other,
lower, DEFAULTS files.
Any other files found in a directory (other than those ignored by
the baseignore and ignore variables) are assumed to
contain a list of SNMP variables to monitor. thresh
uses sub-directories named .thresh to store status and log
information.
thresh data files consist of zero or more lines in
the following format:
<ws>type<ws>variable-or-OID<ws>value<ws>
<ws># comment ...
<ws>
where "<ws>" indicates white space.
The data file elements are defined as follows:
type A single capital letter indicating the comparison to be
made in determining "normal".
C Changeable - the variable's value may change, but
should be reported each time it changes. This is
useful for semi-static data, or for monitoring
things such as device interface status changes.
G Greater than - the variable is reported if its
current value is less than or equal to the
specified value.
I Increasing - the variable is reported if its
current value is less than its previous value.
This is handy for watching for reset times, such
as the "system.sysUpTime.0" variable resetting
when a device reboots.
L Less than - the variable is reported if its
current value is greater than or equal to the
specified value.
S Static - the variable is reported if its current
value is not equal to the specified value. If no
value is specified, then it is compared against
the first-retrieved value of the variable. This
is useful for monitoring things that should never
change, such as "system.sysName.0".
V Variable - the value can be anything, but it is
queried and tracked to allow for later
investigation or review.
variable-or-OID
An SNMP variable name (or name fragment that
scotty (1) can interpret) or SNMP OID (e.g.
1.3.6.1.2.1.1.3.0) to be monitored.
value
A value for the comparison. Required for G and L,
optional for S, and not allowed for C, I, and V.
SEE ALSO
thresh (1)
threshvars (5)
AUTHOR
John Sellens
NAME
threshmail - mail notifier for thresh
messages
SYNOPSIS
threshmail
recipient ...
DESCRIPTION
threshmail
expects notification messages from thresh (1) on its
standard input, which it appropriately reformats into a mail
message, and sends to every recipient given on the command line.
threshmail would typically be set as the
notifier in a thresh DEFAULTS file.
AUTHOR
John Sellens
NAME
threshvars - configuration
variables understood by thresh
DESCRIPTION
thresh (1) understands and
observes certain configuration variables. Those
variables can be provided on the command line or
in files named DEFAULTS within the
thresh data hierarchy.
Variable names are case sensitive and are expected to be in lower case letters.
DEFAULTS FILES
DEFAULTS
files consist of zero or more lines in the
following format:
<ws>VARNAME<ws>=<ws>VALUE<ws> <ws># comment ... <ws>where "<ws>" indicates white space. Values can contain embedded blanks.
Variables set by a DEFAULTS file apply at that level of the data hierarchy and below, unless overridden on the command line or in a DEFAULTS file further down the tree.
GENERAL VARIABLES
debug Generate debugging output.
Boolean. Default: true
verbose Generate informational
messages.
Boolean. Default: true
walkonly Walk the data tree,
describing the hierarchy, but not querying,
reporting, or logging.
Boolean. Default:
false
topdir The top of the data
hierarchy.
Default: /usr/local/thresh
name The DNS name or partial name
of the device or hierarchy represented by the
current directory in the data hierarchy. Gets
extended by the name of each directory as
thresh descends down the hierarchy,
but can be overridden in a DEFAULTS
file.
Default:
baseignore
The base list of "glob" patterns of file and
directory names to ignore in the data hierarchy.
If you override this variable, make sure that you
end up ignoring the .thresh status
directories.
Default: . .. .* CVS RCS README
README.* DEFAULTS core *.core
ignore The extended list of
"glob" patterns of file and directory names to
ignore in the data hierarchy. Having two
variables makes it easy to augment the default
list of names to ignore without overriding the
base list. Note that setting
ignore = *
will cause the hierarchy
rooted at that location to be excluded from all
processing.
Default:
prune
If true, do not process further down this
hierarchy if the current node is unreachable.
This is essentially the equivalent of setting
ignore = *
if the current node is
unreachable. This is useful, for example, for
limiting the error messages that are generated if
a gateway router is unreachable.
Boolean.
Default: false
SNMP VARIABLES
community The SNMP V1 read
community string to use to query hosts and
devices.
Default: public
mib Specifies a file name that
contains
an SNMP MIB that will immediately be read and
compiled into the running program.
Default:
timeout How long to wait for a
response to an SNMP get request, in seconds.
Default: 10
retries Number of
times to retransmit an SNMP get request during the
timeout interval.
Default: 5
NOTIFICATION VARIABLES
notifier Pipe notification
messages to this program, often a mailer or mail
interface.
Default: /bin/cat
frequency Minimum time before
sending another identical notification message, in
minutes.
Default: 30
describe Whether or not to
include a variable's MIB description field in
notification messages.
Boolean. Default:
true
msgformat A printf-style
format string used to print notification messages,
with the (cryptically named) variables smnpvar,
message, complabel, compval, newlabel, newval,
desc. This could use a little more
sophistication.
Default: %s: %s\n %s %s\n
%s %s%s\n
LOGGING VARIABLES
log Write out of spec entries to
a log file.
Boolean. Default: true
logger A command like
logger (1) that writes to syslog
(3).
Default: /usr/bin/logger
syslog A syslog facility.level
pair, as accepted by the logger (1)
command, such as "local1.info". If set, out of
spec entries are piped to the logger
command.
Default:
syslogtag
Tag to use on syslog'd entries.
Default:
thresh
SEE ALSO
thresh
(1)
threshdata (5)
AUTHOR
John Sellens