################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the Tenth USENIX System Administration Conference Chicago, IL, USA, Sept. 29 - Oct. 4,1996. For more information about USENIX Association contact: 1. Phone: (510) 528-8649 2. FAX: (510) 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Automatic and Reliable Elimination of E-mail Loops Based on Statistical Analysis E. Solana, V. Baggiolini, M. Ramluckun and J. Harms - University of Geneva (Switzerland) ABSTRACT One of the major threats to the E-mail service are message loops, mail messages which endlessly circulate between a number of E-mail relays and severely deteriorate or even interrupt the messaging service. Although the standards ruling E-mail communications include mechanisms to prevent message loops, these mechanisms are often unimplemented and, if they are, can fail to work reliably under certain circumstances. This paper describes a novel approach to the automatic and reliable elimination of looping messages. Statistical properties of the message traffic are analyzed to detect messages that might be looping. Such messages are earmarked, and if they are observed to actually be looping, they are eliminated. This method has been implemented and successfully applied in the multi-protocol E-mail environment of the University of Geneva. Problem description A message is said to be looping if it is forwarded endlessly in a circu- lar manner among a set of mail relays (Message Transfer Agents or MTAs). Nor- mally, two or three mail relays take part in a message loop, but there can be more than that. The period of a loop (the time it takes to complete) ranges from a few seconds to a few minutes. Looping messages usually grow in size, since mail relays typically add header lines to messages they process. A message that loops during a weekend can be forwarded tens of thousands of times, growing in size to megabytes and producing megabytes of logging data. The situation is even worse if distribution lists are involved (c.f. [1]): the number of messages explodes and soon the mail system crashes under the load. These scenarios illustrate how important it is to detect and elimi- nate looping messages automatically. Loops are caused by wrong addressing configuration on one or more sys- tems: o bad configuration of mail relays, e.g., a relay that does not ``recog- nize'' messages addressed to it and, instead of keeping them for local delivery, forwards them to a central mail relay which sends them back again[1]. ---------------- [1]This is the behavior of many pre-installed sendmails. o automatic forward mechanisms such as aliases, distribution lists, or per- sonal forward mechanisms (.forward files) pointing to invalid addresses. o automatic reply mechanisms, such as vacation programs or non-delivery reports. The mechanisms that are directly under control of a proficient adminis- trator account for only a small part of the errors. The problem lies much more in the forwarding facilities that are under user control, such as ~/.forward files and vacation programs or even mail relays installed without the adminis- trator knowing. Furthermore, even if all the mail relays in a given responsi- bility domain are perfectly configured, a mail transaction with a misconfig- ured or non-compliant site may result in a message loop. Standard Mechanisms For Loop Detection There are two typical countermeasures that are integrated into the mail relays or taken by postmasters: o Counting the Receivedfrom lines in the message header. RFC 821 [2] says every mail relay processing a message must add such a line to the message header and count the number of Receivedfrom lines already contained. When the header encloses more than a pre-defined number (e.g., 25) of such lines, the message is considered to be looping and automatically sent back to the expeditor together with a non-delivery report. Fixing such a limit is quite a delicate task (especially due to the growing complexity of the Internet). o Tracing MessageIDs. MessageIDs allow to uniquely identify E-mail mes- sages. Postmasters can inspect the log-files and check if the same Message ID recurs multiple times. If this is the case, the corresponding message is considered to be looping. Both of these approaches work well if there are only mail relays of one protocol involved. However, they can fail in a multi-protocol environment (e.g., SMTP [2, 3] together with X.400 [4][2]) ---------------- [1]This is the behavior of many pre-installed sendmails. ---------------- [2]Although the standard for protocol conversion (RFC-1327 [5]) de- scribes how trace fields should be converted, a number of mail relay prod- ucts are not compliant to these recommendations. or in case of faulty config- urations or non-compliance to the protocol recommendations. When a message is translated into a different E-mail protocol by a gateway, Receivedfrom lines can get lost; some gateways can even be explicitly configured to eliminate them. The same problem exists for Message IDs: some mail relays change the old Message ID and assign a new one. Furthermore, both methods fail if a new message is generated in the looping process (e.g., vacation or non-delivery reports). Related work Most existing solutions rely on the mechanisms described above and are normally included in mail relay implementations. A recent work being conducted by D. Bernstein [7] proposes a set of recommendations for so-called automail- ers (a generic term designating mail relays, list exploders, vacation pro- grams, etc.). Due to the great number and diversity of existent software implementations, compliance to these recommendations will not be achieved in a foreseeable future (if ever...) and the need of external mechanisms, as the one proposed in this paper, will persist. To our knowledge, there is no independent, commonly agreed-on solution for preventing loops between E-mail systems. The approach adopted by some administrators consists in periodically observing the message traffic for a given period of time in order to find potential loops. Unfortunately, the most common practice consists in reacting ``a posteri- ori'', alerted by the unpleasant effects of a loop in the E-mail system. A more scientific approach to solve the problem of looping messages has been studied by J. F. Paccini in his Ph.D. thesis [8]. He proposes a method to identify a loop by tracing suspect messages amongst the set of involved mail relays, the tracing information being shared by each mail relay by means of a standardized distributed database (X.500, [9]). The two major drawbacks of this approach are the significant management overhead introduced by a dis- tributed database and the security issues related to the sharing of sensitive information (message trace statistics) across administrative domain borders. Our Approach The proposed method is based on a statistical analysis of the messages recently exchanged by a single mail relay. As will be explained, message loops have very particular statistical properties that are visible at a single observation point. We propose to exploit these properties in order to detect suspect messages without having to consider management information from other mail relays. Since message loops occur occasionally, we have chosen to trigger the detection mechanism upon specific conditions directly related to loop prob- lems: abnormal message rate, disk, and/or CPU usage. E-mail related information can be obtained by periodically querying the mail relay console or by means of an SNMP agent that provides monitoring data in the form of a MIB (c.f. [6]). Information concerning system resources can be retrieved by means of standard UNIX commands. The process of loop detection and suppression is composed of two phases: the first focuses on the detection of suspect messages based on the logging information of recent message exchanges and the second aims at the accurate identification and elimination of messages that do actually loop. Detection Of Suspect Messages The aim of this process is to identify those messages that might be loop- ing. To achieve this, we build a statistical database based on the historical information (log-files) generated by the mail relay where the observation point is located. Note that the necessary data is contained in the log-files of about any existing E-mail implementation: o The Sender and Receiver addresses. Since Message ID information is not always reliable, the only invariant of a looping message observed from one specific point are the sender and receiver addresses of the envelope. Fig- ure 3 shows that this assertion does not hold when the observation point changes (address rewritten by involved mail relays). This information, that we denote SR pair, constitutes the search key of our statistical database. o The number of messages exchanged by this SR pair. o Absolute time of last message exchange. This value is used to determine whether the loop is still active or not. o Time lapses between occurrences of the same SR pair: A set of integer val- ues containing the time intervals in seconds between messages with identi- cal SR pairs. Based on this, we pick the most often recurring SR pairs for further analysis. Generally, this is a threshold-based decision which depends on site characteristics and the extent of historical information being considered. If the loop has been running for some time, the number of messages exchanged might be sufficient to identify a potential loop[3]. ---------------- [1]This is the behavior of many pre-installed sendmails. ---------------- [2]Although the standard for protocol conversion (RFC-1327 [5]) de- scribes how trace fields should be converted, a number of mail relay products are not compliant to these recommendations. ---------------- [3]3000 messages are unlikely to be exchanged by two partners in a single day. However, if the loop has just begun and only a few iterations have occurred, a more elaborate analysis is required to more accurately distinguish looping messages from normal traffic. This is done by performing statistical computations over the time lapses set, namely: o Mean value of time lapses: The arithmetic mean value calculated over the above set of message lapses. This value represents the frequency of arrival of suspect messages. A low mean value implies a high frequency of arrival of messages with identical SR pairs and, thus, a high probability of a loop. ---/*------------------------------------------------------------- Variables for each SR Pair: number_of_messages, time_of_last_message, lapses_variance, lapses_mean_value; Thresholds (depending on site characteristics): minimum_amount_of_messages, timefactor, suspect_var_threshold, suspect_mean_value */ function loop_probability_fct { loop_probability = 0; /* initial value */ /* if the number of messages is under a given threshold don't go any further. (statistical computations are not sufficiently accurate for a small amount of msgs.) */ if (number_of_messages < minimum_amount_of_messages) return(loop_probability); if (((current_time - time_of_last_message) / mean_value) > timefactor) /* if the time interval between the last message and the current time is timefactor times higher than the mean value, we assume that the potential loop has been fixed */ return(loop_probability); /* if the variance is below a given threshold, increase probability */ if (lapses_variance < suspect_var_threshold) loop_probability += 1 - lapses_variance / suspect_var_threshold; /* ditto with the mean value */ if (lapses_mean_value < suspect_mean_value) loop_probability += 1 - lapses_mean_value / suspect_mean_value; if (number_of_messages > suspect_number_of_messages) /* if the number of messages exchanged is abnormally high, increase probability */ loop_probability += number_of_messages / total_number_of_exchanged_messages; if (loop_probability > 1) return (1); else return(loop_probability); } /* loop_probability_fct */ Figure 1: Heuristic Function Computing Loop Probability o Variance of time lapses: The variance of time lapses measures the devia- tion from the mean value. This value is very relevant since looping mes- sages are normally processed at uniform time intervals and, as a conse- quence, display a remarkably low variance. On the contrary, normal E-mail traffic has a very irregular time distribution resulting in very high variance values. Based on this statistical information, we have elaborated a heuristic function for computing the loop probability for each SR pair appearing repeat- edly in the log-files. Figure 1 shows the inner workings of the heuristic function that has been used to detect loops in our site. Firstly, the function tests if the number of messages exchanged is suffi- ciently high to ensure accurate statistical computations. The lapse between the last message and the current time is compared to the mean value of previ- ous transactions lapses in order to detect if the potential loop is still active. If both tests yield positive results, the effective loop probability is computed. Low values for either the variance or the mean value increase loop probability. An abnormally high number of messages exchanged will also augment the result returned by this function. It should be noted that a sus- pect value in any of these parameters will have significant impact in the overall loop probability returned by the function. The following example shows the results of the loop probability computa- tions for a message loop (Figure 2a) and for a regular group of messages exchanged by an SR pair (Figure 2b). In the first case, note both the high frequency (approx. one message every 20 seconds) and the uniform distribution of time lapses between message arrival times. ------------------------------------------------------------------ Figure 3: Detection of Suspect Messages This results in very low mean and variance values and, as a consequence, in a very high (close to 1) loop probability value. In the second example, where normal mail traffic is analyzed, the high variance and mean values yield a remarkably low loop probability. Figure 3 shows a typical loop scenario where a message (addressed to loop@mail[123]) circulates endlessly between three mail relays (for instance, due to an incorrect forward address combination for the loop user). The statistics are retrieved at the observation point number 1. Note that due to the high number of messages exchanged and the low mean and variance values, the loop probability (loopprob) for this message is very high compared to the remaining messages. As a consequence, the SR pair [a@mail3 | loop@ mail1] is considered to be suspect and used as input for the next processing stage. -MESSAGES-FROM-solana@cui.unige.ch--------------------------------------------- TO = loop@bad.domain NUMBER OF MESSAGES EXCHANGED = 75 TIMELAST = 05/04 10:29:55 LAPSES = [17, 17, 32, 18, 17, 18, 31, 33, 18, 19, 17, 17, 17, 34, 17, 23, 20, 22, 17, 18, 17, 20, ...] MEAN VALUE = 20 VARIANCE = 66 LOOP PROBABILITY = 0.97 Figure 2a: Statistical Properties of a Loop ------------------------------------------------------------------------------- MESSAGES FROM solana@cui.unige.ch TO normal@cui.unige.ch NUMBER OF MESSAGES EXCHANGED = 70 TIMELAST = 05/04 10:25:53 LAPSES = [205, 1800, 3357, 264, 45, 188, 256, 313, 294, 141, 107, 135, 161, 689, 1267, 691, 5402, 24640, ...] MEAN VALUE = 1125 VARIANCE = 85443 LOOP PROBABILITY = 0.01 Figure 2b: Statistical Properties of Normal Messages ------------------------------------------------------------------------------- The next phase will permit to accurately determine if the selected SR pairs actually correspond to a message loop. If yes, the necessary corrective actions are carried out. Identification and Elimination of a Message Loop If suspicious SR pairs have been identified, the program will actively wait for the corresponding messages to arrive and carry out the following actions for each of them: o Freeze the suspect incoming message in order to enable write operations on the message header. o Mark the message with an Unique Identification Tag (UIT) using a reserved header record[4] ---------------- [1]This is the behavior of many pre-installed sendmails. ---------------- [2]Although the standard for protocol conversion (RFC-1327 [5]) de- scribes how trace fields should be converted, a number of mail relay products are not compliant to these recommendations. ---------------- [3]3000 messages are unlikely to be exchanged by two partners in a single day. ---------------- [4]The header record should be carefully chosen to allow multi-pro- to- col inter-operability. and a well-defined string (containing, for instance, the mail relay identification plus an exact time value). The UIT of each suspicious SR pair will be recorded for further identifi- cation. o Flush the message, i.e., force the message to leave the mail relay again. o Detect the tagged message. Wait for the tagged message to arrive again. The arrival of a message containing both a suspicious SR pair and an UIT means that the message has been processed twice by the mail relay with identical envelope addresses. This fact is an unequivocal loop symptom. If the envelope addresses were different, the presence of an UIT alone would not constitute a sufficient condition for this assertion since regular messages might be processed twice by a mail relay as a result of a user forward to an external address[5]. ---------------- [1]This is the behavior of many pre-installed sendmails. ---------------- [2]Although the standard for protocol conversion (RFC-1327 [5]) de- scribes how trace fields should be converted, a number of mail relay products are not compliant to these recommendations. ---------------- [3]3000 messages are unlikely to be exchanged by two partners in a single day. ---------------- [4]The header record should be carefully chosen to allow multi-pro- to- col inter-operability. ---------------- [5]As a consequence of the address rewriting inherent to the forward process, the message enters and leaves the mail relay with different re- cipient addresses. o Remove the message loop . The immediate action to eliminate a message loop may consist in ``bouncing'' the message back to the sender with an error diagnostic or to discard the message and generate a report directed to the local E-mail administrator. Further actions must then be under- taken to diagnose and fix the actual reasons (configuration errors, proto- col failures, etc.) of the loop. Implementation Experience The proposed loop detection and removal mechanism has been fully imple- mented in two central mail relays of the operational E-mail service in the University of Geneva: the main gateway between X.400 and SMTP and an SMTP relay containing user mailboxes. Apart from a few message loops that were introduced for testing, both mail relays are subject to normal operating con- ditions. During a short calibration period, parameters of the heuristic function were tuned to the particular traffic. After this period, the system could dis- tinguish with remarkable accuracy looping messages from regular traffic. When combined with the mark and remove mechanism described above, the system auto- matically eliminate all possible message loops. The loop detection script has been implemented in PERL and is currently used by postmasters of the Geneva University network and by SWITCH, the major Swiss Internet service provider for academia. The package is freely available at https:// cuiwww.unige.ch/~solana/loop-watch. tar. This work is part of a larger project[6] ---------------- [1]This is the behavior of many pre-installed sendmails. ---------------- [2]Although the standard for protocol conversion (RFC-1327 [5]) de- scribes how trace fields should be converted, a number of mail relay products are not compliant to these recommendations. ---------------- [3]3000 messages are unlikely to be exchanged by two partners in a single day. ---------------- [4]The header record should be carefully chosen to allow multi-pro- tocol inter-operability. ---------------- [5]As a consequence of the address rewriting inherent to the forward pro- cess, the message enters and leaves the mail relay with different recipient addresses. ---------------- [6]Project funded by the ``Fonds National suisse pour la Recherche Scien- tifique'' (FNRS). that focuses on distributed application management with spe- cial focus on E-mail management. Conclusions During several months of utilization, the loop finder script has automat- ically detected and removed several message loops that in the past would have remained undetected for several days and might have caused serious problems to the E-mail service. This shows that even the analysis of basic statistical parameters is sufficient to distinguish looping messages from ordinary traf- fic. As described in the previous section, the loop detection process obtains the message informations from the log-files of mail relays implementations. A version that processes the information provided by an SMTP sniffer - a program capable of analyzing SMTP transactions obtained directly ``on the wire'' - is being finalized. This will notably enhance the portability (no need to con- sider proprietary log-file formats) and the reliability (no influence from mail relay software errors). Although the main objective of our work consisted in detecting and remov- ing message loops, the accurate statistic information obtained, has brought significant insight into the E-mail traffic conditions of our site. In partic- ular, relying on this information, we have operated important changes in the topology of our campus E-mail architecture resulting in an improved load bal- ancing among message relays. Furthermore, the nature of information provided by this script can be easily exploited for accounting and billing issues. Experience shows that in widely distributed and heterogeneous environ- ments such as electronic mail, the recommendations of standard protocols are often not correctly implemented. Mail loops are a good example of a problem that theoretically should not happen but practically does. Furthermore, the incredible growth of E-mail utilization will certainly aggravate this problem. Techniques as the one described in this paper can detect malfunctions before the quality of service provided to end-users degrades. Author Information Eduardo Solana received his M.Sc. degree in Computer Science in 1991 from the University of Geneva. He has been a member of the IBM technical staff for three years. Currently he is a Ph.D. student in the Teleinformatics and Oper- ating Systems (TIOS) group of the University of Geneva. His interests are in the fields of Computer Communications, Management of Distributed Applications, Network Security and Cryptography. Write to him via e-mail at . Vito Baggiolini is a Ph.D. student at the University of Geneva where he has been working since 1993 on the management of distributed applications. He graduated as an electronic engineer from the Federal Institute of Technology in Zurich in 1989. Between 1990 and 1993 he worked at the European Space Agency in Frascati (Italy) in the field of computer networking. Mira Ramluckun received her M.Sc. degree in Computer Science from the University of Compiegne (Paris). Between 1992 and 1994 she worked in the Lin- guistics Department of the University of Geneva in the field of Natural Lan- guage Processing. Since 1994 she is a Ph.D. student in the Teleinformatics and Operating Systems (TIOS) group of the University of Geneva. Her interests are in the fields of Computer Communications, Database Design and Natural Language Processing. Jurgen Harms. Studies in physics at the Polytechnic School of Vienna (Austria) and in electronic engineering at the University of California (USA). Since 1972 professor of computer science at the University of Geneva where he heads the computer department. President of the SWITCH foundation (Swiss national research network). Fields of interest: Operating Systems, Teleinfor- matics. Bibliography [1] J. Udell. ``E-mail adventures''. BYTE magazine, Volume 21, Number 4. April, 1996. [2] J. Postel. ``Simple Mail Transfer Protocol''. Request for Comments 821. August, 1982. [3] D. Crocker. ``Standard for the format of ARPA Internet text messages''. Request for Comments 822. August, 1982. [4] ITU and CIITT - Data Communications Networks. Message Handling Systems. Recommendations X.400-X.420, 1988. [5] S. Hardcastle-Kille. ``Mapping between X.400 (1988) / ISO 10021 and RFC 822''. Request for Comments 1327. May, 1992. [6] S. Kille, N. Freed: ``Mail Monitoring MIB'', Request for Comments 1566. January 1994. [7] D. Bernstein. Tools in the War on Mail Loops. Internet-draft: draft-bern- stein-mail-loops-war-00. txt. April, 1996. [8] J. F. Paccini. De la gestion de reseaux a la gestion d'applications: le role des donnees historiques. Ph. D. Thesis, Universite de Geneve, 1994. [9] ITU and CIITT - Data Communications Networks. Directory. Recommendations X.500- X.521, 1988.