8. File Contents

The most brilliant decision in all of Unix was the choice of a single character for the newline sequence.

- Mike O'Dell, only half jokingly

8.0. Introduction

Before the Unix Revolution, every kind of data source and destination was inherently different. Getting two programs merely to understand each other required heavy wizardry and the occasional sacrifice of a virgin stack of punch cards to an itinerant mainframe repairman. This computational Tower of Babel made programmers dream of quitting the field to take up a less painful hobby, like autoflagellation.

These days, such cruel and unusual programming is largely behind us. Modern operating systems work hard to provide the illusion that I/O devices, network connections, process control information, other programs, the system console, and even users' terminals are all abstract streams of bytes called files. This lets you easily write programs that don't care where their input came from or where their output goes.

Because programs read and write via byte streams of simple text, every program can communicate with every other program. It is difficult to overstate the power and elegance of this approach. No longer dependent upon troglodyte gnomes with secret tomes of JCL (or COM) incantations, users can now create custom tools from smaller ones by using simple command-line I/O redirection, pipelines, and backticks.

Treating files as unstructured byte streams necessarily governs what you can do with them. You can read and write sequential, fixed-size blocks of data at any location in the file, increasing its size if you write past the current end. Perl uses the standard C I/O library to implement reading and writing of variable-length records like lines, paragraphs, and words.

What can't you do to an unstructured file? Because you can't insert or delete bytes anywhere but at end of file, you can't change the length of, insert, or delete records. An exception is the last record, which you can delete by truncating the file to the end of the previous record. For other modifications, you need to use a temporary file or work with a copy of the file in memory. If you need to do this a lot, a database system may be a better solution than a raw file (see Chapter 14, Database Access).

The most common files are text files, and the most common operations on text files are reading and writing lines. Use <FH> (or the internal function implementing it, readline) to read lines, and use print to write them. These functions can also be used to read or write any record that has a specific record separator. Lines are simply records that end in "\n".

The <FH> operator returns undef on error or when end of the file is reached, so use it in loops like this:

while (defined ($line = <DATAFILE>)) {
    chomp $line;
    $size = length $line;
    print "$size\n";                # output size of line
}

Because this is a common operation and that's a lot to type, Perl gives it a shorthand notation. This shorthand reads lines into $_ instead of $line. Many other string operations use $_ as a default value to operate on, so this is more useful than it may appear at first:

while (<DATAFILE>) {
    chomp;
    print length, "\n";             # output size of line
}

Call <FH> in scalar context to read the next line. Call it in list context to read all remaining lines:

@lines = <DATAFILE>;

Each time <FH> reads a record from a filehandle, it increments the special variable $. (the "current input record number"). This variable is only reset when close is called explicitly, which means that it's not reset when you reopen an already opened filehandle.

Another special variable is $/, the input record separator. It is set to "\n", the default end-of-line marker. You can set it to any string you like, for instance "\0" to read null-terminated records. Read paragraphs by setting $/ to the empty string, "". This is almost like setting $/ to "\n\n", in that blank lines function as record separators, but "" treats two or more consecutive empty lines as a single record separator, whereas "\n\n" returns empty records when more than two consecutive empty lines are read. Undefine $/ to read the rest of the file as one scalar:

undef $/;
$whole_file = <FILE>;               # 'slurp' mode

The -0 option to Perl lets you set $/ from the command line:

% perl -040 -e '$word = <>; print "First word is $word\n";'

The digits after -0 are the octal value of the single character that $/ is to be set to. If you specify an illegal value (e.g., with -0777) Perl will set $/ to undef. If you specify -00, Perl will set $/ to "". The limit of a single octal value means you can't set $/ to a multibyte string, for instance, "%%\n" to read fortune files. Instead, you must use a BEGIN block:

% perl -ne 'BEGIN { $/="%%\n" } chomp; print if /Unix/i' fortune.dat

Use print to write a line or any other data. The print function writes its arguments one after another and doesn't automatically add a line or record terminator by default.

print HANDLE "One", "two", "three"; # "Onetwothree"
print "Baa baa black sheep.\n";     # Sent to default output handle

There is no comma between the filehandle and the data to print. If you put a comma in there, Perl gives the error message "No comma allowed after filehandle". The default output handle is STDOUT. Change it with the select function. (See the introduction to Chapter 7, File Access.)

All systems use the virtual "\n" to represent a line terminator, called a newline. There is no such thing as a newline character. It is an illusion that the operating system, device drivers, C libraries, and Perl all conspire to preserve. Sometimes, this changes the number of characters in the strings you read and write. The conspiracy is revealed in Recipe 8.11.

Use the read function to read a fixed-length record. It takes three arguments: a filehandle, a scalar variable, and the number of bytes to read. It returns undef if an error occurred or else the number of bytes read. To write a fixed-length record, just use print.

$rv = read(HANDLE, $buffer, 4096)
        or die "Couldn't read from HANDLE : $!\n";
# $rv is the number of bytes read,
# $buffer holds the data read

The truncate function changes the length of a file, which can be specified as a filehandle or as a filename. It returns true if the file was successfully truncated, false otherwise:

truncate(HANDLE, $length)
    or die "Couldn't truncate: $!\n";
truncate("/tmp/$$.pid", $length)
    or die "Couldn't truncate: $!\n";

Each filehandle keeps track of where it is in the file. Reads and writes occur from this point, unless you've specified the O_APPEND flag (see Recipe 7.1). Fetch the file position for a filehandle with tell, and set it with seek. Because the stdio library rewrites data to preserve the illusion that "\n" is the line terminator, you cannot portably seek to offsets calculated by counting characters. Instead, only seek to offsets returned by tell.

$pos = tell(DATAFILE);
print "I'm $pos bytes from the start of DATAFILE.\n";

The seek function takes three arguments: the filehandle, the offset (in bytes) to go to, and a numeric argument indicating how to interpret the offset. 0 indicates an offset from the start of the file (the kind of value returned by tell); 1, an offset from the current location (a negative number means move backwards in the file, a positive number means move forward); and 2, an offset from end of file.

seek(LOGFILE, 0, 2)         or die "Couldn't seek to the end: $!\n";
seek(DATAFILE, $pos, 0)     or die "Couldn't seek to $pos: $!\n";
seek(OUT, -20, 1)           or die "Couldn't seek back 20 bytes: $!\n";

So far we've been describing buffered I/O. That is, <FH>, print, read, seek, and tell are all operations that use buffers for speed. Perl also provides unbuffered I/O operations: sysread, syswrite, and sysseek, all discussed in Chapter 7.

The sysread and syswrite functions are different from their <FH> and print counterparts. They both take a filehandle to act on, a scalar variable to either read into or write out from, and the number of bytes to read or write. They can also take an optional fourth argument, the offset in the scalar variable to start reading or writing at:

$written = syswrite(DATAFILE, $mystring, length($mystring));
die "syswrite failed: $!\n" unless $written == length($mystring);
$read = sysread(INFILE, $block, 256, 5);
warn "only read $read bytes, not 256" if 256 != $read;

The syswrite call sends the contents of $mystring to DATAFILE. The sysread call reads 256 bytes from INFILE and stores them 5 characters into $block, leaving its first 5 characters intact. Both sysread and syswrite return the number of bytes transferred, which could be different than the amount of data you were attempting to transfer. Maybe the file didn't have all the data you thought it did, so you got a short read. Maybe the filesystem that the file lives on filled up. Maybe your process was interrupted part of the way through the write. Stdio takes care of finishing the transfer in cases of interruption, but if you use the sysread and syswrite calls, you must do it yourself. See Recipe 9.3 for an example of this.

The sysseek function doubles as an unbuffered replacement for both seek and tell. It takes the same arguments as seek, but it returns either the new position if successful or undef on error. To find the current position within the file:

$pos = sysseek(HANDLE, 0, 1);       # don't change position
die "Couldn't sysseek: $!\n" unless defined $pos;

These are the basic operations available to you. The art and craft of programming lies in using these basic operations to solve complex problems like finding the number of lines in a file, reversing the order of lines in a file, randomly selecting a line from a file, building an index for a file, and so on.