[Chapter 3] 3.2.155 split

3.2.155 split

split /PATTERN/, EXPR, LIMIT
split /PATTERN/, EXPR
split /PATTERN/
split

This function scans a string given by EXPR for delimiters, and splits the string into a list of substrings, returning the resulting list value in list context, or the count of substrings in scalar context. The delimiters are determined by repeated pattern matching, using the regular expression given in PATTERN, so the delimiters may be of any size, and need not be the same string on every match. (The delimiters are not ordinarily returned, but see below.) If the PATTERN doesn't match at all, split returns the original string as a single substring. If it matches once, you get two substrings, and so on.

If LIMIT is specified and is not negative, the function splits into no more than that many fields (though it may split into fewer if it runs out of delimiters). If LIMIT is negative, it is treated as if an arbitrarily large LIMIT has been specified. If LIMIT is omitted, trailing null fields are stripped from the result (which potential users of pop would do well to remember). If EXPR is omitted, the function splits the $_ string. If PATTERN is also omitted, the function splits on whitespace, /\s+/, after skipping any leading whitespace.

Strings of any length can be split:

@chars = split //, $word;
@fields = split /:/, $line;
@words = split ' ', $paragraph;
@lines = split /^/m, $buffer;

A pattern capable of matching either the null string or something longer than the null string (for instance, a pattern consisting of any single character modified by a * or ?) will split the value of EXPR into separate characters wherever it is the null string that produces the match; non-null matches will skip over occurrences of the delimiter in the usual fashion. (In other words, a pattern won't match in one spot more than once, even if it matched with a zero width.) For example:

print join ':', split / */, 'hi there';

produces the output "h:i:t:h:e:r:e". The space disappears because it matched as part of the delimiter. As a trivial case, the null pattern // simply splits into separate characters (and spaces do not disappear).

The LIMIT parameter is used to split only part of a string:

($login, $passwd, $remainder) = split /:/, $_, 3;

We encourage you to split to lists of names like this in order to make your code self-documenting. (For purposes of error checking, note that $remainder would be undefined if there were fewer than three fields.) When assigning to a list, if LIMIT is omitted, Perl supplies a LIMIT one larger than the number of variables in the list, to avoid unnecessary work. For the split above, LIMIT would have been 4 by default, and $remainder would have received only the third field, not all the rest of the fields. In time-critical applications it behooves you not to split into more fields than you really need.

We said earlier that the delimiters are not returned, but if the PATTERN contains parentheses, then the substring matched by each pair of parentheses is included in the resulting list, interspersed with the fields that are ordinarily returned. Here's a simple case:

split /([-,])/, "1-10,20";

produces the list value:

(1, '-', 10, ',', 20)

With more parentheses, a field is returned for each pair, even if some of the pairs don't match, in which case undefined values are returned in those positions. So if you say:

split /(-)|(,)/, "1-10,20";

you get the value:

(1, '-', undef, 10, undef, ',', 20)

The /PATTERN/ argument may be replaced with an expression to specify patterns that vary at run-time. (To do run-time compilation only once, use /$variable/o.) As a special case, specifying a space " " will split on whitespace just as split with no arguments does. Thus, split(" ") can be used to emulate awk's default behavior, whereas split(/ /) will give you as many null initial fields as there are leading spaces. (Other than this special case, if you supply a string instead of a regular expression, it'll be interpreted as a regular expression anyway.)

The following example splits an RFC-822 message header into a hash containing $head{Date}, $head{Subject}, and so on. It uses the trick of assigning a list of pairs to a hash, based on the fact that delimiters alternate with delimited fields. It makes use of parentheses to return part of each delimiter as part of the returned list value. Since the split pattern is guaranteed to return things in pairs by virtue of containing one set of parentheses, the hash assignment is guaranteed to receive a list consisting of key/value pairs, where each key is the name of a header field. (Unfortunately this technique loses information for multiple lines with the same key field, such as Received-By lines. Ah, well. . . .)

$header =~ s/\n\s+/ /g;      # Merge continuation lines.
%head = ('FRONTSTUFF', split /^([-\w]+):/m, $header);

The following example processes the entries in a UNIX passwd file. You could leave out the chop, in which case $shell would have a newline on the end of it.

open PASSWD, '/etc/passwd';
while (<PASSWD>) {
    chop;        # remove trailing newline
    ($login, $passwd, $uid, $gid, $gcos, $home, $shell) =
            split /:/;
    ...
}

The inverse of split is performed by join (except that join can only join with the same delimiter between all fields). To break apart a string with fixed-position fields, use unpack.