Compared to the traditional method of mailing printed propaganda, spam costs almost nothing. No mailing infrastructure is needed except for an Internet account and a personal computer. A one or two percent return on investment potential applied to millions of emails results in a number large enough to justify the approach to a number of people.
The first ingredient is a large number of email addresses, usually obtained by harvesting addresses off the Internet. A study conducted by the Federal Trade Commission concluded that 86% of all addresses posted to Web pages and to the newsgroups received spam [5]. Anything resembling an email address is collected and used. Those without the resources to mine the addresses themselves can purchase CDs containing ready-to-use lists of addresses.
Next, a Mass Emailer like DynamicMailer [3] is used to deliver the message to the list of addresses. These programs normally provide certain facilities like header forging, sending emails directly from the user's workstation (bypassing the need for an SMTP server) and others.
Spammers must always act as inconspicuously as possible to avoid account cancellation, revenge from angry users, and legal problems. The sender's email address is often forged to an invalid address or to a throw-away account from a free web-mail provider. These accounts are rapidly filled with bounce messages as a result of outdated and invalid emails in the list.
Instructions on how to obtain the product or service offered are normally contained in the body of the mail, with URLs pointing to the Web page of the seller.
Next, we outline the most common anti-spam techniques in use today.
Misconfigured mail servers will relay any incoming message to any destination without control. These servers can be used to deliver mail indiscriminately and in many cases, anonymously.
Realtime Blackhole Lists (RBLs) are online databases containing the IP address of such servers. These databases are queried by Mail Transfer Agents (MTAs) upon receipt of a new message. If the IP address originating the connection is listed in the database, the connection will be terminated and the appropriate error code will be sent. Some RBLs also keep dial-in IP addresses in an attempt to detect messages sent directly from dial-up workstations.
Many such databases exist today, with MAPS [9] being a well known provider of such services. Most MTAs support this feature with minimal configuration effort.
RBLs are not very effective if used as the only spam prevention method as they can only block mail (spam or not) coming from spammers who use open relays to send their messages. Even though some RBLs specifically list dial-in lines, often it is not practical to use those as many legitimate users send emails directly from their workstations over dialup lines.
Some RBL providers have complicated procedures to remove entries from their databases, leading to a substantial number of false positives (See Section 4.1). Also, there are normally no whitelist mechanisms available to regular users, meaning that it is impossible to grant access to a certain known valid email or IP without supervisory privileges.
Tools in this category try to detect spam by investigating the content of received emails.
Early attempts used simplistic keyword filters, normally implemented with generic mail processing tools. Specific strings like ``Dear Sir'' in the body of the email would flag the message as spam. This produced inaccurate results and a fairly large number of false positives.
A more sophisticated approach to the problem employs scoring of certain keywords, sentences and characteristics pertinent to spam mail. A simple ``Dear Sir'' might indicate that we are dealing with spam, but ``Dear Sir,'' ``Click Here,'' and ``Call now'' in the same email message are a very clear indication. SpamAssassin [20] utilizes this technique.
Usually, a complex set of rules exists with a positive or negative numeric score assigned to each rule. Rules that clearly indicate spam receive high values. Negative values reduce the probability of the email being classified as spam (even in the presence of other clear indications). A rule to match a string like ``Dear Sir/Madam'' will have a lower score than ``Work from home'' as the latter is a clearer indication of the common home employment Internet scams. Expressions like ``Usenix'' and ``Algorithm'' would receive a negative value as they strongly suggest that this is not a spam mail. The mail will be marked as spam if the sum of all matching scores exceeds a user-defined threshold.
There are specific and distinct rules for the email headers and body. The rules applied to the headers will be equally effective, regardless of the language used to send the email. Body analysis however is greatly affected by language: a set of rules that perfectly detects emails in the English language could completely miss emails written in Italian or Korean.
A new variation of this technique, employing a naïve Bayes classifier to isolate junk mail has become quite common [17]. Bogofilter [15] is a popular tool in this category.
These tools can be trained to learn about good and bad combinations of tokens and their probabilities of occurring together in incoming mails. As they learn more about what is spam and what is not, the chances of correctly detecting junk emails increase.
Bayesian classifiers suffer from the same problem as the other content filtering approaches: a well-crafted message body is likely to pass unharmed, as few identifiable elements are present. Messages without a text body (for example, advertising as an attached image file) might pose a problem too since very little textual content is available for analysis [10].
To illustrate how difficult it is to classify e-mails as spam based solely on content, consider the following email:
From: abcd12345@domain.com To: yourname@yourdomain.com Subject: Check this out. Hi, I found a tool called ``CleanUp'' at (https://www.example.com). It clears the contents of the browser cache and history lists so other users won't know the websites you have been browsing. Very useful!
This could represent either a perfectly valid email (if coming from a known sender) or a clear indication of spam (if coming from an unknown sender). Analysis based solely on content is very difficult in this case, specially for someone without the knowledge about any previous association between the recipient and the sender.
Spam messages are typically sent unaltered to a large number of recipients. Distributed Anti-Spam Networks attempt to curb spam by preventing its propagation. Vipul's Razor [14] and the Distributed Checksum Clearinghouse (DCC) [18] are examples of programs using this technique.
Agents installed on the user's workstation will generate fuzzy signatures of every incoming mail. These signatures ignore small changes in the text so that slight variations in the spam body or headers will still generate similar signatures.
Next, a centralized database is queried looking for the signature computed from the incoming mail. If a match is found, the email is assumed to be spam and is discarded.
Users are responsible for reporting spam to the database (normally by means of a special account where the spam mail should be forwarded). Once spam has been reported by one user, all others will be protected against that specific spam mail or similar ones.
Under normal circumstances, the chances of false positives is very small, as actual users report spam to the system. Also, the chances of catching spam that is already in the database is quite good.
This approach, however, can only identify spam once a previous (or similar) case exists, meaning that newly sent spam will not be detected. A reasonable number of spam reporters is required to make the system work reliably and the quality of the database is largely dictated by the quality of the reports. Invalid or incorrect reports could poison the database, causing valid emails to be reported as spam. There is also a scalability concern, given the centralized nature of the database servers containing the signatures.
Another point to note is that an extra TCP connection is needed to contact the database servers, making this a non-viable alternative for users behind restrictive firewalls.
Challenge-authentication agents work on the premise that mail should only be delivered after senders identify themselves. ASK, TMDA [11], and QConfirm [13] are examples of programs that employ such a technique (with variations). When a new mail is received, the sender is checked against a database of known addresses. If the sender is known, the email is delivered immediately. Otherwise, a challenge mail will be sent back to the originator requesting confirmation. Once the sender becomes known to the system, further messages coming from the same address will be immediately accepted.
This method exploits the fact that most spammers use invalid return addresses in their messages. Since no (or limited) text analysis is performed, it is impossible to force delivery of the message by crafting the message body to look like an innocent message.
Unlike the previous alternatives, in challenge-response systems, subsequent action is required from the sender to deliver the email. Spammers utilizing valid email accounts could reply to the challenge and get authenticated to the system. Special provisions exist to avoid mail-loops or sending challenges to mailing-lists.