;login: The Magazine
of USENIX & SAGE

 

system administration research

burgess_mark

by Mark Burgess
<Mark.Burgess@iu.hioslo.no>

Mark is associate professor at Oslo College and is the author of cfengine and winner of the best paper award at LISA 1998.

Part 2: Analytical System Administration

In my previous article (;login:, June 2000) I argued in favor of a more scientific approach to system administration. The key point was that we should be careful in making assertions without having something concrete to back them up. Also, I wanted to encourage more research of a scientific nature for LISA. In this follow-up article I want to look at some more concrete examples of how system administrators can get involved in research, both for the good of the community and as a discipline for making judgments on the job.

Research begins with a question: I wonder if . . . ? Or: Is it true that . . . ? Without such a question, you don't know what you are looking for. The danger of beginning with a question is that you then just set out to prove what you think is true, or disprove what someone else has said. The point of research is to get objective results, or report on ideas that stimulate progress in the field. We also need to take steps to check ourselves.

Everyone would like to think that they have an open mind, but that is not the way science works. I recall my time as a physics student at university in England. I recall writing that scientists are not really objective, but are human beings driven by opinions and ideas. My essay got a lousy grade from one of the lecturers: he proclaimed that scientists must be among the most objective people around. He was making a fatal mistake.

Self-made experts are almost always the least objective people around. They have worked hard to build up their knowledge, they have opinions, and they are pretty sure they know right from wrong without having to check. This kind of knowledge is based on experience, but take care not to be seduced by the dark side of the force! When you believe you know the answer, you might never find out that you are wrong. Science is not about being an expert, it is about being stupid, i.e. never assuming, always feigning ignorance, always being critical. Clearly it is easier to criticize or debunk than it is to make a constructive contribution, and many scientists have been seduced by that dark side (standing in the way of progress and confusing the issue with opinions rather than facts), but this is the challenge for a scientific community. The essence is to have a discipline that will lead to the right conclusion whether your mind is initially open or not.

Asking the Question

What kind of questions would a system administrator ask? Here are some examples:

  • Which is better, static mounting or auto-mounting?
  • In firewalls, how much does a proxy delay service availability?
  • Given identical hardware, which Web server and OS can be most efficient: Apache/ Microsoft IIS, GNU/Linux, NT, FreeBSD?
  • Why is my system running more slowly than it used to?
  • Why does program X dump core every time it starts on one host, but not on another?

As you read these questions, you are probably already forming your own opinion about what the correct answer is. But what evidence do you have for your opinion? Long experience? A gut feeling? Hearsay? Let us look at the first of these. Which is better, static mounting or auto-mounting? On reflection, we realize that this question has too many unknowns. So we try again:

Given a particular operating system, which is better?
But restricting to a single OS is not very interesting, since the results from one OS might actually be different from the results from another. A comparison of two would be the minimum we would accept as indicative.

When is static mounting more efficient than auto-mounting, or vice versa?
We now realize that we are in far more trouble than we realized. The question is potentially very complex. It will be necessary to consider a variety of cause-effect correlations in order to get a reasonable answer to my initially vague question. Then as we continue, we think of more things: What effect does environment have on whether the auto-mounter is better than static mounting? Which auto-mounter? Which static NFS implementation, TCP or UDP? Most important of all: What are we going to measure to find out the answer?

The Research Loop

Answering questions is a difficult business. It is a process:

#!/usr/bin/findout
#
#
ignorant=true;
while (ignorant || alive)
{
 Assess Motivation and Subject;
 DoMeasurements/Experimentation
 Interpret results
 Criticize interpretation
 if (results interesting)
 {
  Communicate results
 }
}

Notice how the loop doesn't end. Why not? Things change. An answer one day is not the right answer on a different day. Come to think of it: What constitutes an answer at all?

Take NFS: Does the question about static or auto-mounting have an answer? Does it have many answers, depending on time, place, environment, and so on? Can we ever say that the we have found the "right answer"? In the NFS example, there might not be a correct answer to the problem. Unless one can prove that one is intrinsically more efficient than another, and intrinsically more elegant, then most people would not care to ask such a broad question. The question should be restricted to the type: Under what circumstances is X more efficient than Y? This can be answered by measuring numbers.

Questions that ask us to make value judgments are very difficult to answer. Questions of this type can be discussed (this might be useful), but there is no right or wrong answer. The result of such a discussion can be used to motivate other studies, provide background knowledge for another study, or even demonstrate that a simple claim is in fact not true in general. However, nothing is proven to be true in general.

The most valuable kind of knowledge is the deeper understanding of why and how things happen. Understanding is usually about figuring out mechanisms that relate cause to effect. Papers that increase understanding or awareness of actual phenomena are useful. Some questions are easy to answer, because the distance between cause and effect is small, e.g., Why does the file disappear when I type rm? Here the link is an atomic operation: delete. Other questions are harder to answer, e.g., Why does the system run slowly at certain times? Now there might be several causes to the observed effect. Elucidating the causal connection could be difficult. There are several approaches for doing this:

  • Inspired guessing (followed by verification)
  • Recognizing the signatures of known effects
  • Gradual elimination of possibilities
  • Statistical analyses (to test ideas or separate overlapping signals)

What Is Already Known?

There is no need to reinvent the wheel. It is both a waste of time and annoying to the original inventor. It is good practice to look at what has been done before, in order to avoid wasting time. It is also important to refer to what has been done before. The purpose of giving references is to place work in a context, to allow others to make the connections for themselves (for pedagogical reasons, and to verify your conclusions), and to avoid repeating what is already known. Tracking down references can be hard, and most work gets repeated in different contexts due to ignorance.

Naturally the inventor of a triangular wheel would like to be recognized for his/her work, but it might just confuse the issue. The competition in research can be so intense that it becomes absurd. Researchers in some environments are notorious for actively writing to authors of other papers telling them that they should be referred to. You must make a judgment about the importance of earlier work. On the one hand, you are not obliged to be the historian, summarizing the entire history of a subject on every occasion, but, on the other hand, you do need to tell the readers where you have come from and where you are going. Also, if others suggested the study you are making, it is important to refer to them: they thought of it first, and there was probably a reason why, relating to their own research.

For the NFS question, a quick search through LISA proceedings revealed three papers on NFS measurements and one on the auto-mounter. However, none of these dealt with comparing the efficiency of auto-mounter filesystems with static-mounted filesystems. This indicates that a study of this kind might be worthwhile. To confirm this hypothesis, I would then need to go and search through other journals, such as the ACM library or IEEE journals.

Measurements and Scales

Getting numbers is the most convincing way of making a point. A numerical value is less open to woolly interpretation than a descriptive result. To find numerics, you first have to find out what numbers can be measured and which of them, if any, are relevant to what you are trying to discover.

There are many sources of numbers. Typical monitoring commands for UNIX-like hosts, for instance, include:

  • ps
  • top
  • netstat
  • iostat(Solaris)
  • xload
  • perfmeter(Solaris)

These take snapshots of kernel values. The values need to be collected over time under similar conditions. If measurements cannot be made under similar conditions, the result will contain overlapping signals: one is the signal you are trying to measure; the other signals are background noise. So one thing to be cautious of in a multitasking system is that the existence of multiple processes implies multiple overlapping signals. If you are studying a single process, how are you going to separate the effect of the one process you are looking at from all of the others?

Several techniques are available. An understanding of scale can help here. One of the most important things to understand about dynamic systems, whose measurements change over time, is that very different things are going on at different scales. For example, put your hand in front of you and hold it still. At the scale of 1 cm your hand appears solid and still. If you swap your eye for a bionic appendage and zoom in (don't try this at home, kids) down to the sub-millimeter level, you see that there are all kinds of cellular things bubbling around and moving. Zooming in even further to the nanometer level, you see atoms flying around like crazy; further still, electrons going around in circles. Which picture is correct? Clearly there is information contained at every level, but the information concerns different aspects of the whole. The same phenomenon applies to any complex system. Computers are such a complex system.

For example, suppose we choose to measure disk usage. On a particular machine with no users but with some network services running, many temporary files are being created and destroyed, but on average nothing much happens. On average, the number of files does not change, since as many files are destroyed as are created. On the other hand, the number of bytes grows steadily. The astute experimenter determines that this is due to quietly growing log files, not to a leak in the network services.

If scale is important, how long do we have to measure something until we can be sure that we have seen what is really going on at all levels? How frequently do we have to measure in order to resolve an effect? Nyquist's sampling law says that, if we want to see effects on a time scale of t seconds, we need to sample at least every t/2 seconds. That is why CD recordings sample at 44 kHz, when humans can only hear up to about 20 kHz at best. Similarly, if we want to see an effect at scale t seconds, we need to sample for at least 4t seconds in order to be sure of seeing a whole cycle. These values are very rough; in general, you can never have enough data. You should sample much more than you think you need, just to be sure.

What overlapping signals might we see? Many computer measurements have a daily rhythm that is caused by the pattern of work of its users. It peaks around midday and is lowest during the night. This is a periodic signal that is mixed in with the general chatter of system behavior. Here cause and effect are easy to identify by plotting a graph of the data. It might be possible to see when users have their lunch break, just by measuring process behavior.

Be aware that the act of measurement can affect the measurement you are making. In order to make a measurement, you have to start a program that measures the system, but this uses resources too. Are those resources significant? For instance, if you run UNIX top to look at which process consumes most resources, and consistently see that top itself is the program that features highest in the readout, then you know that you are disturbing the system too much. This is a problem with any finite system. It is like the famous "uncertainty principle" in physics, sometimes called "Schrodinger's cat." The act of measurement might be the very thing that disturbs the system. To subtract the effects of measurement, we need to be able to control or predict their effect on the system.

Any meaningful result must be repeatable. If the result is not reproducible, it is of no value to science. Sometimes it is necessary to transform or manipulate data in order to find the features that are reproducible. It is not always the numbers that are reproducible, but their distribution or pattern of change. Finding the right variable or representation of data is a challenge for a researcher. This is part of what makes science fun. Even in something as simple as a desktop PC, there are things going on that are not easily appreciated without a little analysis. Finding out something that you hadn't realized, or something that confirms your suspicions, is a fantastically satisfying experience.

Resorting to Statistics

Statistics is about making the best of a bad lot. You don't have any clear idea what makes something happen, so you look to see what the laws of numbers can tell you. Statistics are used, broadly speaking, to separate signal from noise. More advanced notions of statistics have to do with determining relationships between cause and effect, subject to (i.e., filtered according to) certain conditions.

Statistical averaging is a little bit like half-closing your eyes to look at the data. When you deliberately blur an image, you see the main features more clearly, since distracting minor variations are blurred out. The technical term for this is coarse graining. The aim of averaging is to separate signal from noise. It is like the example of a hand, discussed above. If you always looked at your hand at the level of atoms, you would find it very confusing. However, if you change glasses and blur out the effects of individual atoms, averaging over individual cells so that your hand looks like a solid continuum, then it starts to make more sense as a hand. It becomes possible to understand its function.

In the example of temporary files above, we can safely say that the average number of files is a constant over long periods of time. Over short periods there are changes in the number of files. The average amount of disk space used rises gradually, however, due to the log files. Here we see how the process of averaging separates out behavior at different scales. The "error" or standard deviation is a measure of the average size of a short-time signal (often called a fluctuation).

Correlations can also be used to link cause and effect. There are auto-correlations, or correlations in a single measured value at particular times. This is a measure of how similar a value is at different periods of time. The correlation length (or time) is a measure of the distance over which the system seems to look uniform. Sudden changes in correlation length are referred to as phase transitions, after the same phenomenon in physics. They signal dramatic changes (called catastrophic changes) that imply a significant change in behavior. Cross-correlations measure how similar two separate sets of measurements are, i.e., whether it is likely that one measurement is affecting the other, or whether both signals have a common explanation. These are some of the tools that statistics has to offer for analyzing data.

Trial and Error

An important part of research is the ability to try and fail. One has to be willing to fail maybe 90% of the time in order to produce 10% of stuff worth telling someone about. The art of research is in channelling that 90% of failure back into the 10% of success, i.e., not just giving up on something interesting, but persevering until real progress is made.

System administrators have many research skills already. It is a part of the job to fiddle with stuff until the answer pops out. Turning these skills into a research project takes only a little discipline. The discipline is not wasted, even if does not always amount to a published paper.

 

?Need help? Use our Contacts page.
Last changed: 25 nov. 2000 ah
Issue index
;login: index
USENIX home