| |
Search and Modification Operations
- m/PATTERN/gio
- /PATTERN/gio
- Searches a string for a pattern match,
and returns true (1) or false (''). If no string is
specified via the =~ or !~ operator, the $_ string is searched. (The string specified
with =~ need not be
an lvalue--it may be the result of an expression
evaluation, but remember the =~ binds rather tightly.) See also the section
on regular expressions.
If /
is the delimiter then the initial 'm' is optional. With
the 'm' you can use any pair of non-alphanumeric
characters as delimiters. This is particularly useful for
matching Unix path names that contain '/'. If the final
delimiter is followed by the optional letter 'i', the
matching is done in a case-insensitive manner. PATTERN
may contain references to scalar variables, which will be
interpolated (and the pattern recompiled) every time the
pattern search is evaluated. (Note that $) and $| may not be interpolated because they look
like end-of-string tests.) If you want such a pattern to
be compiled only once, add an "o" after the
trailing delimiter. This avoids expensive run-time
recompilations, and is useful when the value you are
interpolating won't change over the life of the script.
If the PATTERN evaluates to a null string, the most
recent successful regular expression is used instead.
If used in a context that requires
an array value, a pattern match returns an array
consisting of the subexpressions matched by the
parentheses in the pattern, i.e. ($1, $2, $3...). It does
NOT actually set $1, $2, etc. in this case, nor does it
set $+, $`, $& or $'. If the
match fails, a null array is returned. If the match
succeeds, but there were no parentheses, an array value
of (1) is returned.
Examples:
open(tty, '/dev/tty');
<tty> =~ /^y/i && do foo(); # do foo if desired
if (/Version: *([0-9.]*)/) { $version = $1; }
next if m#^/usr/spool/uucp#;
# poor man's grep
$arg = shift;
while (<>) {
print if /$arg/o; # compile only once
}
if (($F1, $F2, $Etc) = ($foo =~ /^(\S+)\s+(\S+)\s*(.*)/))
This last example splits $foo into
the first two words and the remainder of the line, and
assigns those three fields to $F1, $F2 and $Etc. The
conditional is true if any variables were assigned, i.e.
if the pattern matched.
The "g" modifier
specifies global pattern matching--that is, matching as
many times as possible within the string. How it behaves
depends on the context. In an array context, it returns a
list of all the substrings matched by all the parentheses
in the regular expression. If there are no parentheses,
it returns a list of all the matched strings, as if there
were parentheses around the whole pattern. In a scalar
context, it iterates through the string, returning TRUE
each time it matches, and FALSE when it eventually runs
out of matches. (In other words, it remembers where it
left off last time and restarts the search at that point.)
It presumes that you have not modified the string since
the last match. Modifying the string between matches may
result in undefined behavior. (You can actually get away
with in-place modifications via substr() that do not change the length of the
entire string. In general, however, you should be using s///g
for such modifications.) Examples:
# array context
($one,$five,$fifteen) = (\`uptime\` =~ /(\d+\.\d+)/g);
# scalar context
$/ = ""; $* = 1;
while ($paragraph = <>) {
while ($paragraph =~ /[a-z]['")]*[.!?]+['")]*\s/g) {
$sentences++;
}
}
print "$sentences\n";
- ?PATTERN?
- This is just like the /pattern/
search, except that it matches only once between calls to
the reset operator. This is a useful optimization
when you only want to see the first occurrence of
something in each file of a set of files, for instance.
Only ?? patterns local to the current package are reset.
- s/PATTERN/REPLACEMENT/gieo
- Searches a string for a pattern, and
if found, replaces that pattern with the replacement text
and returns the number of substitutions made. Otherwise
it returns false (0). The "g" is optional, and
if present, indicates that all occurrences of the pattern
are to be replaced. The "i" is also optional,
and if present, indicates that matching is to be done in
a case-insensitive manner. The "e" is likewise
optional, and if present, indicates that the replacement
string is to be evaluated as an expression rather than
just as a double-quoted string. Any non-alphanumeric
delimiter may replace the slashes; if single quotes are
used, no interpretation is done on the replacement string
(the e modifier overrides this, however); if backquotes
are used, the replacement string is a command to execute
whose output will be used as the actual replacement text.
If the PATTERN is delimited by bracketing quotes, the
REPLACEMENT has its own pair of quotes, which may or may
not be bracketing quotes, e.g. s(foo)(bar) or
s<foo>/bar/. If no string is specified via the =~ or !~ operator,
the $_ string is searched and modified. (The
string specified with =~ must be a
scalar variable, an array element, or an assignment to
one of those, i.e. an lvalue.) If the pattern contains a
$ that looks like a variable rather than an end-of-string
test, the variable will be interpolated into the pattern
at run-time. If you only want the pattern compiled once
the first time the variable is interpolated, add an
"o" at the end. If the PATTERN evaluates to a
null string, the most recent successful regular
expression is used instead. See also the section on
regular expressions. Examples:
s/\bgreen\b/mauve/g; # don't change wintergreen
$path =~ s|/usr/bin|/usr/local/bin|;
s/Login: $foo/Login: $bar/; # run-time pattern
($foo = $bar) =~ s/bar/foo/;
$_ = 'abc123xyz';
s/\d+/$&*2/e; # yields 'abc246xyz'
s/\d+/sprintf("%5d",$&)/e; # yields 'abc 246xyz'
s/\w/$& x 2/eg; # yields 'aabbcc 224466xxyyzz'
s/([^ ]*) *([^ ]*)/$2 $1/; # reverse 1st two fields
(Note the use of $ instead of \ in
the last example. See section on regular expressions.)
- study(SCALAR)
- study SCALAR
- study
- Takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing
many pattern matches on the string before it is next
modified. This may or may not save time, depending on the
nature and number of patterns you are searching on, and
on the distribution of character frequencies in the
string to be searched--you probably want to compare
runtimes with and without it to see which runs faster.
Those loops which scan for many short constant strings (including
the constant parts of more complex patterns) will benefit
most. You may have only one study active at a time--if
you study a different scalar the first is "unstudied".
(The way study works is this: a linked list of every
character in the string to be searched is made, so we
know, for example, where all the 'k' characters are. From
each search string, the rarest character is selected,
based on some static frequency tables constructed from
some C programs and English text. Only those places that
contain this "rarest" character are examined.)
For example, here is a loop which inserts
index producing entries before any line containing a
certain pattern:
while (<>) {
study;
print ".IX foo\n" if /\bfoo\b/;
print ".IX bar\n" if /\bbar\b/;
print ".IX blurfl\n" if /\bblurfl\b/;
...
print;
}
In searching for /\bfoo\b/, only
those locations in $_ that
contain 'f' will be looked at, because 'f' is rarer than
'o'. In general, this is a big win except in pathological
cases. The only question is whether it saves you more
time than it took to build the linked list in the first
place.
Note that if you have to look for
strings that you don't know till runtime, you can build
an entire loop as a string and eval that to avoid
recompiling all your patterns all the time. Together with
undefining $/ to input
entire files as one record, this can be very fast, often
faster than specialized programs like fgrep. The
following scans a list of files (@files) for a list of
words (@words), and prints out the names of those files
that contain a match:
$search = 'while (<>) { study;';
foreach $word (@words) {
$search .= "++\$seen{\$ARGV} if /\\b$word\\b/;\n";
}
$search .= "}";
@ARGV = @files;
undef $/;
eval $search; # this screams
$/ = "\n"; # put back to normal input delim
foreach $file (sort keys(%seen)) {
print $file, "\n";
}
- tr/SEARCHLIST/REPLACEMENTLIST/cds
- y/SEARCHLIST/REPLACEMENTLIST/cds
- Translates all occurrences of the
characters found in the search list with the
corresponding character in the replacement list. It
returns the number of characters replaced or deleted. If
no string is specified via the =~ or !~ operator,
the $_ string is translated. (The string specified
with =~ must be a
scalar variable, an array element, or an assignment to
one of those, i.e. an lvalue.) For sed devotees, y
is provided as a synonym for tr. If the SEARCHLIST
is delimited by bracketing quotes, the REPLACEMENTLIST
has its own pair of quotes, which may or may not be
bracketing quotes, e.g. tr[A-Z][a-z] or tr(+-*/)/ABCD/.
If the c modifier is specified, the
SEARCHLIST character set is complemented. If the d
modifier is specified, any characters specified by
SEARCHLIST that are not found in REPLACEMENTLIST are
deleted. (Note that this is slightly more flexible than
the behavior of some tr programs, which delete
anything they find in the SEARCHLIST, period.) If the s
modifier is specified, sequences of characters that were
translated to the same character are squashed down to 1
instance of the character.
If the d modifier was used, the
REPLACEMENTLIST is always interpreted exactly as
specified. Otherwise, if the REPLACEMENTLIST is shorter
than the SEARCHLIST, the final character is replicated
till it is long enough. If the REPLACEMENTLIST is null,
the SEARCHLIST is replicated. This latter is useful for
counting characters in a class, or for squashing
character sequences in a class.
Examples:
$ARGV[1] =~ y/A-Z/a-z/; \h'|3i'# canonicalize to lower case
$cnt = tr/*/*/; \h'|3i'# count the stars in $_
$cnt = tr/0-9//; \h'|3i'# count the digits in $_
tr/a-zA-Z//s; \h'|3i'# bookkeeper -> bokeper
($HOST = $host) =~ tr/a-z/A-Z/;
y/a-zA-Z/ /cs; \h'|3i'# change non-alphas to single space
tr/\200-\377/\0-\177/;\h'|3i'# delete 8th bit
|
|
|