;login: The Magazine of USENIX & SAGEInternet

 

speaking http

A File-Uploader Tool

by Oleg Kislyov
<oleg@pobox.com>

Oleg Kiselyov is a computer scientist with Computer Science Corporation, in Monterey, California. Soon it will be twenty years since he started using computers to solve somebody else's problems.

 

This article describes a simple HTTP uploading tool. HTTP is commonly viewed as something that happens between a browser and a Web server. However, HTTP is useful in its own right, for example, as a good file-distribution protocol with a number of important advantages over ftp. This article gives an example how to speak HTTP and get understood.

The HTTP uploader is somewhat reminiscent of Microsoft Frontpage's server extensions. It lets you push (binary or text) content — Web pages, images, binary files — from one computer to another. If the source platform is Winxx/WinNT, you can make a shortcut of a script that will let you upload files just by dragging and dropping them onto an icon. The tool works through Web proxies and gateways. If you can download Web pages, you should be able to upload files as well.

The uploader tool works on various versions of UNIX and WinNT/Winxx with different HTTP servers: I tried Apache, Netscape, and IIS. The tool is made of two Perl scripts, one of them being a CGI script. The choice of the implementation language is accidental and irrelevant. What deserves admiration is the HTTP protocol, whose power and simplicity make even far more complex applications possible.

HTTP Protocol
By definition[1], HTTP is a request/response protocol that exchanges messages in a format similar to that used by Internet mail (MIME). An HTTP transaction is essentially a remote procedure call. It is usually a blocking call, although HTTP/1.1 provides for asynchronous and batch modes. HTTP allows intermediaries (caches, proxies) to cut into the response-reply chain.

An operation to execute remotely is expressed in HTTP as an application of a request method to a resource. Additional parameters, if needed, are communicated via request headers or a request body. The request body may be an arbitrary octet-stream. The HTTP/1.1 standard defines methods GET, HEAD, POST, PUT, DELETE, OPTIONS, TRACE, and CONNECT. A particular server may accept many others. This extensibility is a rather notable feature of HTTP. The parties can use not only custom methods but custom request and reply headers as well. In addition, a client and a server may exchange meta-information via "name=value" attribute pairs of the standard "Content-Type:" header.

Most of the HTTP transactions performed every day are done behind the scenes by browsers, proxies, robots, and servers. Yet the protocol is so simple that one can easily speak it oneself. The only requirement is a language or tool that is able to manipulate text strings and establish TCP connections. Even a simple telnet application may do in a pinch, which is often useful for debugging. Server-side programming is less demanding: a servlet or a scriptlet does not need to bother with the network connectivity, authentication, access restrictions, SSL, and other similar chores. Server modules or FastCGI give a server-side programmer even more tools: load-balancing, persistence, database connectivity, etc. This article demonstrates how to use Perl scripts to speak and respond HTTP directly.

Making an Upload Request — An uptow Script
An uptow is a Perl script that speaks the client part of HTTP. It asks a server to perform a remote operation: store submitted data in a desired location. The server will respond with the result code or an error message. The script is called as follows:

uptow dest-directory local-file-path

It will copy the file specified by local-file-path to a remote site. The data will be placed into a specified dest-directory on the remote site under the same (base) filename. The server will typically prepend a predefined path to this dest-directory (e.g., /usr/local/htdocs or /w/data) to confine file updates to that part of its file system. This script publishes the files synchronously and always tells the result of the transfer.

The remote site to which to publish is identified by a number of configurational parameters: $REMOTE_HOST, $REMOTE_PORT, and $TAKER_URI. It is trivial to modify the script to get these parameters from environment variables or to read them from a configuration file.

When called as uptow mysite/dev /tmp/data.txt, the script establishes a TCP connection to a destination HTTP server ($REMOTE_HOST) and sends the following message:
PUT /cgi-bin/admin/Update-w-Taker.pl/mysite/dev/ HTTP/1.0 CRLF
Host: hostname.org:80 CRLF
User-Agent: UPTOW/1.3 CRLF
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== CRLF
Content-Type: text/plain; filename="data.txt" CRLF
Content-Length: 1234 CRLF
CRLF
contents of the file /tmp/data.txt as it is.

The first line of the message is a request line. It is followed by request headers, in a "Name: value" format similar to that of RFC822 mail headers. The names in the headers are case-insensitive. The first empty line signifies the end of the headers. If the message includes a body — as it does in our case — the payload data is sent immediately after the empty line. The request line and the header lines are terminated by a carriage return/line feed (CRLF) character sequence: the character with decimal code 13, followed by the character with decimal code 10.

The request line tells what operation to perform, where to put the payload data, and what version of the HTTP protocol we will speak. Some obsolete firewalls and proxies may either refuse PUT requests outright or break the connection without indicating any error. Also, some Web servers may be configured by default to disallow PUT methods (see below for the server configuration). If you encounter that situation, you can change the uptow and Update-w-Taker.pl scripts to use a POST method instead. The latter is more widely accepted. The PUT method nevertheless seems to be the most appropriate for upload. The location to store the payload data is specified by a Uniform Resource Identifier (URI). It is a character string that looks like an absolute UNIX file path. The meaning may, however, be different, as we will see. The uptow script creates this URI by appending the desired upload location to the $TAKER_URI. The "update taker" section below explains what an HTTP server does with such a string.

It may happen that the server to which we upload data is not directly accessible. For example, the server and the uptow client may be separated by a firewall. All HTTP transactions between computers on the different sides of a firewall must therefore go through a dedicated Web proxy or a gateway. A Web browser or any other HTTP client has to be made aware of such an arrangement. Specifically, to tell the uptow script to use a proxy you have to set a $PROXY_NAME parameter. The script will then connect to that proxy and have it relay the request to the destination server. The relay request looks just like the direct upload request above. Only the first line is slightly different:

PUT http://hostname.org/cgi-bin/admin/ Update-w-Taker.pl/mysite/dev/
    HTTP/1.0 CRLF

That is, instead of a URI naming a resource to create, we send the full URL, including the "http://hostname.org" part. Here hostname.org is the name of a host to which we upload the file. The proxy strips away this "http://hostname.org" part when it sends the request to the destination server.

The HTTP protocol defines a number of headers that should or may be used during an HTTP exchange. Here we will describe a particular subset of headers that is used in file-upload transactions.

The Host header identifies the request target server. It is a good idea always to supply this header. Moreover, it is mandatory in version 1.1 of the HTTP protocol. The User-Agent header identifies the client software — the uptow script, in our case. The server usually quotes this information in its logs. An HTTP server may be configured to demand to know the identity of a user itself before it will consider a request of its agent. The Authorization header should be present then to specify an authentication scheme and the corresponding credential. In the most basic authentication scheme — the one used in the example shown — a user is identified by a symbolic ID and the password. These two strings separated by a single colon (:) character and BASE64-encoded constitute the corresponding credential. Every Web server is guaranteed to support the basic scheme. Yet it is hardly secure, since it transmits passwords in an easily decodable form — almost in plain text. Incidentally, ftp and telnet protocols suffer the same problem. HTTP/1.1 defines a more secure Digest scheme[2]. An HTTP client may also attempt a Secure Socket Layer (SSL) connection to a Web server. SSL is a lower-level (transport) protocol; therefore, the content of an HTTP conversation is unaffected by the fact it is to be transmitted over an SSL connection.

When a request such as ours has a body, the type and the size of its data have to be identified, by Content-Type and Content-Length headers. The former should tell the media type of the data: the "MIME type," as it is often called. Content-Length is the size of the body in bytes. If the data being uploaded is ASCII text, the media type may be set to "text/plain," as above. When the payload is intended to be stored without any further processing, the "application/octet-stream" MIME type seems the most appropriate. Although the request line and the headers are in ASCII, the body of an HTTP message can carry arbitrary data.

Unfortunately, some obsolete Web proxies and gateways (notably Raptor 5.0) are not 8-bit transparent: they do not like zero bytes in a request stream. Apparently firewall programmers used a function strncpy() where memcpy() would have been more appropriate. The uptow script tries to check whether a file to send is ASCII or binary. If it's ASCII, the media type of the Content-Type header is set to "text/plain," and the file is sent as it is. Otherwise, the data is encoded into a hexadecimal stream; BASE64 encoding can be used as well. The media type of "application/x-octet-stream-b2a" identifies the encoded content. A Transfer-encoding header may seem the most fitting place to specify an encoding. Alas, Apache accepts only one value for this request header: chunked. Any other value in Transfer-encoding results in a BAD_REQUEST error. In any case, the payload encoding concerns only pushing of data via a particular obsolete proxy, which I happen to be burdened with. If your Web proxy follows the HTTP standard or you connect to a server directly, you can set the media type to "text/plain" or "application/octet-stream" and forget about encoding.

It is not commonly recognized that a Content-Type header may carry parameters in the "name=value" format. The parameters are separated from the media type and from one another by semicolons. The value can be an arbitrary string, possibly quoted if it contains spaces and other special characters. In our example, Content-Type has one parameter: filename. It tells the base name of the file being uploaded. We could have just as well passed this information via a custom request header, for example, X-Filename: data.txt. HTTP is an extensible protocol, which explicitly allows custom headers. A server ignores any headers it does not recognize.

HTTP protocol has another powerful feature that unfortunately remains relatively obscure: the body of an HTTP message may be composed of several parts. This is similar to multipart/mixed or multipart/digest MIME email messages, which may carry several pieces of information within a single entity. We can therefore upload several files in one transaction by encapsulating them as separate parts of a single request body. We can also upload a tar file and have the server extract its members. In any case, the corresponding modifications to the uptow and Update-w-Taker.pl scripts are trivial.

Strictly speaking, we do not have to use the uptow script to upload a file. For example, we can forego convenience and employ a spartan tcp-transaction tool, tcp-trans[3]. We can even enter telnet hostname.org 80 on the command line and type in the request, line by line. Pressing a "Return" key is enough to terminate a line. Although it is not the same as sending the CRLF combination, many Web servers are rather forgiving.

If a server accepts the submitted data and successfully stores it in a desired location, it sends an acknowledgment, an HTTP message:

HTTP/1.1 201 Created /w/data/mysite/dev/data.txt CRLF
Server: Apache/1.3.6 (Unix) CRLF
Date: Fri, 29 Oct 1999 00:18:48 GMT CRLF
CRLF

The first line of a server response is a status line. It tells the protocol version the server speaks, a numerical result code, and a brief description of the success or failure of the request. The numerical code is a three-digit number intended primarily for a nonhuman agent. A code within the 200 range signifies a successful completion of a request. A 3xx result code tells the agent that an additional action is necessary; a 4xx code is returned if the request is invalid or cannot be fulfilled (for example, because a user failed to authenticate itself or does not have sufficient permissions). Result codes within the 500 range indicate a serious problem on the server side; see the HTTP document[1] for more details.

The status line in a server response is followed by reply headers and an empty line. The latter signifies the end of the headers. The response body, if sent, follows right after the empty line. In case of the 201 reply, there is no body. If the server rejects an upload request or fails to satisfy it, the server sends a response as well, with an appropriate error code:

HTTP/1.1 403 Forbidden CRLF
Server: Microsoft-IIS/4.0 CRLF
Date: Tue, 02 Nov 1999 16:55:00 GMT CRLF
Content-type: text/html CRLF
CRLF
<title>THW-taker Error</title>
<h1>THW-taker Error</h1>
This server encountered an error:<p> <b> d:/temp/bb/uptow.pl is not writable, No such file or directory </b>

Update Taker — An Uploading Server
Update-w-Taker.pl is a CGI script to update a Web site remotely. It takes submitted data sent by the uptow script or a similar application and stores the data in a desired place within the $Dest_root directory tree.

When an HTTP daemon receives the request, the daemon notices that the request URI string starts with /cgi-bin/. This matches a ScriptAlias rewriting template of the server's configuration. Having performed this and possibly other substitutions and alias expansions, the server scans the components of the resulting path. For example:

/usr/local/www/cgi-bin/admin/Update-w-Taker.pl/mysite/dev/

The HTTP server notices that /usr, /usr/local, . . . /usr/local/
www/cgi-bin/admin are all directories, whereas /usr/local/www/
cgi-bin/admin/Update-w-Taker.pl is an executable file, residing in a directory that the server knows has an ExecCGI permission. The server checks access restrictions that apply to the script or the /usr/local/www/cgi-bin/admin directory. For example, the server verifies that the client passed hostname/address filtering rules, the user authenticated itself, and the site or directory configuration allows the PUT method. Finally, the server launches the Update-w-Taker.pl script, passing the payload data from the request body to the script's standard input. The request headers and the client and server identification are passed via the process environment:

CONTENT_LENGTH=10
CONTENT_TYPE=text/plain; filename="data.txt"
DOCUMENT_ROOT=/w/data/htdocs
GATEWAY_INTERFACE=CGI/1.1
HTTP_HOST=localhost:80
HTTP_USER_AGENT=UPTOW/1.3
PATH_INFO=/mysite/dev/
PATH_TRANSLATED=/w/data/htdocs/mysite/dev/
QUERY_STRING=
REMOTE_ADDR=127.0.0.1
REMOTE_PORT=34022
REQUEST_METHOD=PUT
REQUEST_URI=/cgi-bin/admin/Update-w-Taker.pl/mysite/dev/
SCRIPT_NAME=/cgi-bin/admin/Update-w-Taker.pl
SERVER_ADMIN=oleg@hostname.org
SERVER_NAME=hostname.org
SERVER_PORT=80
SERVER_PROTOCOL=HTTP/1.0
SERVER_SOFTWARE=Apache/1.3.6 (Unix)
TZ=GMT

In particular, REQUEST_METHOD tells the method: PUT in our case. All but the well-known request headers are passed as environment variables whose names start with "HTTP_", e.g., HTTP_HOST and HTTP_USER_AGENT. If we submitted a request with the header X-Filename, the CGI script would check for an environment variable HTTP_X_FILENAME. When the request method is PUT, CONTENT_TYPE and CONTENT_LENGTH environment variables must be present to tell the message size and data format.

When parsing the transformed URI above, the server stopped at /usr/local/www/cgi-bin/admin/Update-w-Taker.pl. But the URI continues with /mysite/dev/. This string, if not empty, becomes the content of the environment variable PATH_INFO. The HTTP daemon treats this information as a string — the server does not make any attempt to check whether this string represents a local file, or even whether the string is a valid path string at all.

The content-type of a submitted file must be either

application/x-octet-stream-b2a; filename="data.txt"

or

text/plain; filename="data.txt"

This content is stored in a file with the given "basename" in a directory specified by the PATH_INFO parameter, after prepending the $Dest_root. Thus it is generally impossible to place the content outside the $Dest_root tree. Alas, symbolic directory links may defeat this safeguard. The target file is created if needed. The script must have permissions to write into this file or create it. This script responds in a "201 Created" message or in one of the HTTP error codes. All Taker's activity is logged.

Note that both uptow and Update-w-Taker.pl scripts are (deliberately) written using only the most basic facilities: the core Perl and a Socket module. Therefore it is trivial to rewrite the script in some other language, such as Python or TCL.

HTTP Versus FTP as a File-uploading Protocol
HTTP is a stateless protocol requiring only a single TCP connection, and therefore less resource-hungry than ftp.

Both ftp and HTTP can be used to upload files from within a firewall. HTTP, however, is designed to operate transparently through proxies and gateways, while ftp requires special SOCKS, etc.—enabled clients and possibly a PASSV mode.

HTTPFS can rely on authentication mechanisms already built into Web servers, in addition to its own access control.

Whenever a file gets uploaded, a receiving HTTP server can synchronously fire up triggers and run arbitrary hooks. This is very difficult to accomplish with ftp. Moreover, if an uploaded file is meant to be fed into an application (e.g., tar, content indexer, META-tag creator, etc.), a receiving HTTP server can launch an application and have it process data while it arrives. There is no need to save incoming data to a file and then pass it to an application. HTTP offers similar advantages over ftp as a file downloading tool.

HTTP and ftp also differ in how tightly they couple a client and a server. When an ftp client uploads a file, it has to perform a cd and possibly chmod, ren, and other operations on a remote server, in addition to the PUT operation. If an administrator of the remote site wishes to have the content put under a different name in a different location, she cannot do that unless she talks to the user making an upload and gets him to change the cd command. During an ftp session a client exercises control — albeit limited — over a server. This is not the case with an HTTP upload. A client does not perform any directory navigation or file operations on the server site. The client merely hands over the data and indicates desired file and directory names and similar meta-information. It's up to the receiving server to store, process, or even discard the content as the server thinks fit. The client has no idea of or control over the way the server processes the submitted data. That means a server administrator can change handling of the incoming content at will — and the client will never know or care.

Advanced Applications of HTTP
HTTP can be used for far more advanced tasks — for example, to support network filesystems like NFS or the Andrew File System (AFS). Moreover, HTTP can trivially implement the "Semantic File System" by Gifford et al. That filesystem builds virtual directories wherein the name of a directory corresponds to a query, and the content of the directory consists of files that match the query (represented by symbolic links in the original implementation). Indeed, HTTP provides a file-centric access to remote resources, which can be anything that a server knows how to apply GET/PUT/DELETE methods to. To a client, the exact nature of reading and writing of the resource is irrelevant — to the client, they all look like files.

A particular HTTP-based network virtual filesystem is described at <http://pobox.com/~oleg/ftp/HTTP-VFS.html>. It allows one to access, create, and modify remote files as if they were on a local filesystem and to handle RFC822 email messages as if they were local read-only directories. Each email header with the message's body constitutes a "file." An advantage of HTTPFS is that it lets one develop XML, etc., "filesystems" quickly, without any need to modify the kernel.

REFERENCES
[1] J. Gettys, J. Mogul, H. Frystyk, L. Masinter,
P. Leach, and T. Berners-Lee, "Hypertext Transfer Protocol — HTTP/1.1" <http://www.w3.org/Protocols/rfc2616/rfc2616.html>

[2] J. Franks, P. Hallam-Baker, J. Hostetler,
S. Lawrence, P. Leach, A. Luotonen, E. Sink, and L. Stewart, HTTP Authentication: Basic and Digest Access Authentication, RFC 2617, June 1999. <http://www.ietf.org/rfc/rfc2617.txt>

[3] tcp-transactor-- a shell tool.
<http://pobox.com/~oleg/ftp/Communications.html#tcp-trans>

Appendix: HTTP Uploading Tool
#!/usr/local/bin/perl -w
#
# This is a script to publish a file on a remote web site via HTTP
#
# This script is a client part of a HTTP copy facility, with Update-w-Taker.pl
# CGI script being a server part. Both scripts can be downloaded from
# http://pobox.com/~oleg/ftp/Perl
# See also http://zowie.metnet.navy.mil/~spawar/JMV-TNG/Publishing.html
# for more details. The client-server system this script is a part of
# is rather similar to FrontPage's server extensions.
#
# Synopsis:
#   uptow dest-directory local-filename
#
# This script will copy the file specified by the 'local-filename' to a
# remote site. It will be placed into a given 'dest-directory'
# on the remote site under the same (base) name. The remote site will
# typically prepend a pre-defined path to this 'dest-directory'
# (e.g., /usr/local/htdocs or /w/data) to confine file updates
# to that part of its filesystem.
#
# $Id: uptow.pl,v 2.0 1999/11/02 20:58:49 oleg Exp oleg $

   # Configuration parameters
$PROXY_NAME =""; # if empty, no proxy is used
$PROXY_PORT = 80;
$REMOTE_HOST = "localhost";
$AUTH_CREDENTIAL = "Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ=="; # if empty, it is not used
$REMOTE_PORT = 80;
$TAKER_URI = "/cgi-bin/admin/Update-w-Taker.pl";
$USER_AGENT = "UPTOW/1.3";
$CRLF="\r\n";
#$TAKER_URI="/cgi-bin/oleg/test-cgi-my";
use integer;
use Socket;

my $buffer;             # i/o (socket) buffer...
my $transfer_chunk = 1024;

   # Main module
@ARGV == 2 or &help("Two arguments are expected");
my $dest_dir = $ARGV[0];
my $file_name = $ARGV[1];
$file_name =~ m!([^\\/]+)$! or die "Invalid filename $file_name";
my $base_name = $1;

   # Check the file to publish...
stat($file_name);
-r _ || die "The file to publish — $file_name — does not exist, or unreadable";
my $file_size = -s _;
open(FILE_CONTENT,$file_name) || die "Failed to open $file_name: $!";
binmode FILE_CONTENT;
my $encoding = $ENV{windir} || -B FILE_CONTENT; # On WinNT, -B is not implemented
$encoding && print STDERR "File $file_name appears to be binary and will be encoded\n";

print STDERR "Sending $file_name of $file_size bytes...\n";
my $resource_to_put = "$TAKER_URI/$dest_dir/";
$resource_to_put =~ s![/\\]+!/!g;  # replace double-slashes-backslashes with a single slash

   # Establish the connection with a server
$|=1;        # Set autoflush on...
my $host_to_connect = $PROXY_NAME || $REMOTE_HOST;
my $iaddr_to_connect = inet_aton $host_to_connect;
$iaddr_to_connect || die "Can't resolve the remote host or proxy name $host_to_connect: $!";
my $port_to_connect = $PROXY_NAME ? $PROXY_PORT : $REMOTE_PORT;
print STDERR "Connecting to $host_to_connect:$port_to_connect...\n";
socket(SOCK, PF_INET, SOCK_STREAM, getprotobyname('tcp')) || die "socket: $!";
connect(SOCK, sockaddr_in($port_to_connect, $iaddr_to_connect)) || die "Failed to connect: $!";
binmode SOCK;
print STDERR "Connection established!\n";

   # Making the request (first in $buffer)
$buffer = "PUT " .
   ( $PROXY_NAME ? "http://$REMOTE_HOST:$REMOTE_PORT" : "" ) .
   $resource_to_put . " HTTP/1.0" . $CRLF;
$buffer .= "Host: $REMOTE_HOST:$REMOTE_PORT" . $CRLF;
$buffer .= "User-Agent: $USER_AGENT" . $CRLF;
$AUTH_CREDENTIAL and $buffer .= "Authorization: $AUTH_CREDENTIAL" .$CRLF;
$buffer .= "Content-type: " .
   ( $encoding ? "application/x-octet-stream-b2a" : "text/plain" ) .
   '; filename="' . $base_name . '"' . $CRLF;
$buffer .= "Content-Length: " .
   ( $encoding ? $file_size + $file_size : $file_size ) . $CRLF;
$buffer .= $CRLF; # End-of-headers

syswrite SOCK,$buffer,length($buffer) || die "Request sending error: $!";

my $to_read = $file_size; my $res;
if( $encoding )
{
while ( $to_read > 0 &&
     ($res = read FILE_CONTENT,$buffer,
     ($transfer_chunk < $to_read ? $transfer_chunk : $to_read))) {
   syswrite SOCK,unpack("H*",$buffer),$res+$res || die "socket write error $!";
   $to_read -= $res;
}
} else {
while ( $to_read > 0 &&
     ($res = read FILE_CONTENT,$buffer,
     ($transfer_chunk < $to_read ? $transfer_chunk : $to_read))) {
   syswrite SOCK,$buffer,$res || die "socket write error $!";
   $to_read -= $res;
}
}
$to_read == 0 || die "Failed to read the input file completely: $!";
close FILE_CONTENT;
print STDERR "Request sent\n";

# Read the status line — the first line of the response...
sysread SOCK,$buffer,$transfer_chunk || die "Error reading the status line: $!";
$buffer =~ m!^HTTP/1.\d+\s+(\d+)\s+(.+)!|| die "Invalid status line: $buffer";
my $response_code = $1;
print STDERR "Status: $response_code $2\n";

# Read the rest of the response and dump it...
print $';
while( ($res = sysread SOCK,$buffer,$transfer_chunk) > 0 )
{
print $buffer
}
close SOCK;

if( $response_code == 304 )
{
print STDERR "Not Modified\n";
exit 1;
}

if( $response_code >= 300 )
{
print STDERR "Error\n";
exit 4;
}

print STDERR "Success\n";
exit 0;

# Print help as how to use the program. Print $1 as the title
sub help {
$_ = shift;
print STDERR "\n$_\n";

open(THIS_SCRIPT,"$0") || die "Can't open this script to print out help, due to $!";
while( <THIS_SCRIPT> ) {
/^\#!/ && next;
/^\#/ || last;
print STDERR $'
}
close THIS_SCRIPT;
exit 4
}



 

?Need help? Use our Contacts page.
Last changed: 20 Jul. 2000 mc
Issue index
;login: index
USENIX home