|
Perl Practicum: The Email of the Species
by Hal Pomeranz
A lot of people seem to be interested in writing Perl scripts that
either send email or parse email messages. This may have something
to do with the growth of spamming software, or it may simply be a
symptom of the growth of the Web/CGI and the Internet in general.
This column presents some sample code to make handling email in
Perl relatively painless.
Is This Address Valid?
There are a couple of ways to validate email addresses. The
simplest validation is to compare the address against a regular
expression. One possibility is:
|
/^[^@]+@[^@]+\.[A-Za-z]{2,4}$/
|
Translate this to "some stuff followed by `@', followed by more
stuff, and ending with a literal dot and two to four letters" (four
letters for the ".arpa" and ".nato" domains). This still permits
invalid email addresses like
|
foo@..com
foo@bar.baz
|
and addresses with various special characters that are not
generally permitted in usernames or domain names. It is probably
dangerous to constrain the username portion of the regexp given the
proliferation of X.400 and various PC email packages that allow all
manner of strange characters on the lefthand side of the address.
However, the righthand side could be tightened up:
|
/^[^@]+@([-\w]+\.)+[A-Za-z]{2,4}$/
|
The righthand side now requires one or more subdomains followed by
a dot before the top level domain specifier. You are welcome to
list out all the valid three- and four-letter domain names-if you
know them all (there are seven valid three-letter domains, prizes
to the first person to send me the correct list in email).
Another alternative is to interrogate the domain name service about
the domain portion of the email address. The difficulty is that you
have to check for either a mail exchanger (MX) record for the
domain or an Internet address (A) record. Here is some sample
output from the nslookup command:
|
% nslookup -q=any cc.swarthmore.edu
Server: localhost
Address: 127.0.0.1
Non-authoritative answer:
cc.swarthmore.edu preference = 0, mail exchanger =
cc.swarthmore.edu
cc.swarthmore.edu internet address = 130.58.64.20
Authoritative answers can be found from:
swarthmore.edu nameserver = CS.swarthmore.edu
swarthmore.edu nameserver = DNS-EAST.PREP.NET
CS.swarthmore.edu internet address = 130.58.68.10
DNS-EAST.PREP.NET internet address = 129.250.252.10
|
If any line starts with the domain name we are querying
("cc.swarthmore.edu") and contains either "mail exchanger" or
"internet address" information (the above domain happens to have
both), then the domain name is valid from an email perspective. We
can codify this into the following function:
|
sub valid_address {
my($addr) = @_;
my($domain, $valid);
return(0) unless ($addr =~ /^[^@]+@([-\w]+\.)+[A-Za-z]
{2,4}$/);
$domain = (split(/@/, $addr))[1];
$valid = 0; open(DNS, "nslookup -q=any $domain |") ||
return(-1);
while (<DNS>) {
$valid = 1 if (/^$domain.*\s(mail exchanger|
internet address)\s=/);
}
return($valid);
}
|
The function returns "-1" on error, "0" if the address is invalid,
and "1" if the address is valid. Note that we verify the address
against the regular expression first, before paying the cost of
invoking another process.
The function still does not verify the user portion of the address,
but this is essentially an intractable problem. With most
organizations installing firewalls between their machines and the
Internet, it is unlikely that your machine could discover, much
less contact, the machine where final delivery will take place.
Only at this machine, however, can you verify the authenticity of
the user portion of the address.
Sending Email
The preferred mechanism for sending email from a program is by
invoking sendmail directly because the program more easily
manipulates header information. Besides the "To:", "From:", and
"Subject:" headers, consider using "Reply-to:", "Errors-to:", and
"Precedence:", particularly if you are sending out a mass mailing
of some sort.
Here is a simple function for sending email to a list of recipients:
|
sub send_email {
my($recip_ref, $header_ref, $body_ref) = @_;
open(MAIL, "| /usr/lib/sendmail @{$recip_ref}") ||
return(undef);
foreach $key (keys(%{$header_ref})) {
print MAIL "$key: $$header_ref{$key}\n";
}
print MAIL "\n";
print MAIL @{$body_ref};
close(MAIL);
return(1);
}
|
The function expects three references as arguments: a list
reference containing the list of actual recipients, a hash
reference containing the header information, and a list reference
containing the lines of the body of the message. The hash reference
should look like this:
|
{'To' ='foo@bar.com baz@bar.com',
'From' ='Mail Program <you@yourdomain.com>',
'Subject' ='This here is some mail',
'Precedence' ='bulk',
...
}
|
Lines in the body should have trailing newlines (or you will have
to modify the function to insert them). The function returns
nonzero on success and undef on failure.
Note that the function has the path to sendmail hard-coded. Change
this if your sendmail binary is not in /usr/lib . If
you are sending a large number of mail messages, be sure to put a
sleep() statement between batches of email, or you will
be responsible for a denial of service attack on your own machine
and your organization's mail gateway.
If you send a large number of email messages in a short period of
time, you will surely start to run more processes than your OS
wants you to. Be sure to defend against this failure if you think
you will be starting more than a few dozen processes.
Receiving Email
Parsing an email message is a little more tricky. A typical email
message looks like this:
|
From somebody@somedomain.com Thu Feb 6 15:19 PST 1997 <header1>: <stuff>
<header2>: <more stuff>
<more stuff for header2>
...
<headerN>: <stuff>
<line1>
...
<lineN>
|
In UNIX mailboxes, messages always begin "\nFrom "
(note the trailing space). That line is followed by one or more
colon-separated lines of header information. Header lines may
continue onto two or more lines, but continuation lines must begin
with whitespace. The headers are terminated by a blank line. The
body is everything else until the next "\nFrom ".
Suppose we have the lines from a single email message broken out
into a list. We need a function to break the message out into a
hash structure for easy manipulation. The keys of the hash will be
the various headers, and the corresponding values will be the
associated data.
|
sub parse_email {
my(@lines) = @_;
my($line, $header, $val, %hash);
shift(@lines);
while (@lines) {
$line = shift(@lines);
last if ($line =~ /^\s*$/);
$line =~ s/\s*$//;
if ($line =~ /^\s+/) {
$line =~ s/^\s+//;
$hash{$header} .= " $line";
}
else {
($header, $val) = split(/:\s+/, $line, 2);
$hash{$header} = $val;
}
}
@{$hash{"BODY"}} = @lines;
return(%hash);
|
First the function throws away the initial "From "
line. Then the function eats lines out of the list until it
encounters a blank line marking the end of the headers. For each
header line, the function checks to see whether the line is a
continuation line (starts with whitespace) or a new header.
Continuation lines are appended to the previous header value. New
lines are split in two on the first colon and stuffed into the
hash. Once the headers are dispensed with, the remaining body
lines are stuffed into a list reference in the hash.
The only difficulty is that certain headers, e.g., "Received:", can
appear more than once. To resolve this problem, change all of the
values in the hash to list references in order to accommodate the
extra data:
|
sub parse_email {
my(@lines) = @_;
my($line, $header, $val, %hash);
shift(@lines);
while (@lines) {
$line = shift(@lines);
last if ($line =~ /^\s*$/);
$line =~ s/\s*$//;
if ($line =~ /^\s+/) {
$line =~ s/^\s+//;
$val .= " $line";
next;
}
push(@{$hash{$header}}, $val) if ($header);
($header, $val) = split(/:\s+/, $line, 2);
}
push(@{$hash{$header}}, $val) if ($header);
@{$hash{"BODY"}} = @lines; return(%hash);
}
|
The algorithm has been modified slightly: instead of stuffing new
header information into the hash immediately and then appending
continuation lines, the entire header is pulled together and
stuffed into the hash only when a new header is encountered. The
expression for appending continuation lines to the last element of
an anonymous list reference in the hash was nearly gibberish.
Be Good
You now have more than enough rope to hang yourself, so let us
close with a couple of admonishments. First, do not use this code
to send unsolicited email or spam to anybody-you are only stealing
from your potential customers and/or targets. Second, if you plan
on writing your own version of the "vacation" program (why? yet
people seem to do this all the time), make sure you pay attention
to the "Precedence:" header and do not send responses to any
message marked "Precedence: bulk". If you do, you will possibly be
spamming an entire mailing list. Thus ends the airing of your
humble author's pet peeves.
Reproduced from ;login: Vol. 22 No. 2, April 1997.
|