08 Part 7: Changing Text with the Substitution Operator

LinuxChix Perl Course Part 7: Changing Text with the s/// Operator

1) Introduction
2) An Example
3) Data Extraction with s///
4) Pitfalls
5) Exercises
6) Answer to Previous Exercise
7) Acknowledgements
8) Licensing


----------------------------------------

1) Introduction

The s/// operator is my (personal) favorite operator in Perl. It
provides a very powerful means of changing text based on regular
expressions.

----------------------------------------

2) An Example

Try running this program:

#!/usr/bin/perl -w
use strict;

my $message = 'You need to RTFM.';

$message =~ s/RTFM/Read The Fine Manual/;
print "$message\n";

As you can see, the s/// operator substitutes one string for another.

Although the above example doesn't demonstrate it very well, s///
takes a regular expression. So the following substitution would also
work:

$message =~ s/[A-Z]{4}/Read The Fine Manual/;

By the way, s/// can take any delimiter, just like its cousin m//:

$x =~ s/foo/bar/;
$x =~ s!foo!bar!;
$x =~ s#foo#bar#;
$x =~ s<foo><bar>;

----------------------------------------

3) Data Extraction with s///

Like m//, s/// can use parentheses to extract data. The data can then
be inserted into the substitution expression, like this:

# Change reference from GIF to PNG.
$html =~ s/<img src="(.*?)\.gif"/<img src="$1.png"/;

The "$1" on the right refers to the "(.*?)" on the left.

----------------------------------------

4) Pitfalls

There are three common sources of errors when using s///. Let's
illustrate them with an example.

#!/usr/bin/perl -w
use strict;

my $html = '<p>Just <!-- xxx --> some <!-- xxx --> text</p>';
$html =~ s/<!--.*-->//; # Strip HTML comments - maybe.
print "$html\n";

The first problem is greediness, just as we saw with m//. The s///
will match from the beginning of the first comment to the end of the
second comment, and strip out the word "some" in the middle. Instead,
let's try this:

$html =~ s/<!--.*?-->//; # Notice the "?".

That turns off greediness: the match will be as small as possible.

But there's another problem: only one comment will be removed. If we
want to remove both comments, we could execute the s/// line twice,
but a better way is to use the "g" option ("g" for "global
substitution"):

$html =~ s/<!--.*?-->//g; # Notice the "g" on the end.

Now every match will be substituted, no matter how many there are.

As for the final pitfall, we'll have to change the example a little:

#!/usr/bin/perl -w
use strict;

my $html = "<p>A <!-- \n multi-line \n --> string in Perl</p>";
$html =~ s/<!--.*?-->//g; # Strip HTML comments - maybe.
print "$html\n";

Each "\n" is a newline. It's silly to hard-code them like that, but
if we had read the HTML code from a file there would probably be a
lot of newlines scattered around and we'd have to deal with them.

If you try executing this program, you'll see that the substitution
doesn't occur. That's because by default the "." pattern matches any
character EXCEPT a newline. To get "." to match newlines as well, use
the "s" option (for "treat as Single line"):

$html =~ s/<!--.*?-->//gs; # Note the "s" after the "g".

In this case, we combined the "g" and "s" options. That's fine, and
the order doesn't matter.

So to conclude, there are three common pitfalls with s///:
a) greediness: use "?" to turn it off
b) number of replacements: use "g" to replace all occurrences
c) newlines: use "s" to match across newlines

----------------------------------------

5) Exercises

a) Harry Potter fans know that Professor Lockhart succeeded Professor
Quirrel as Dark Arts teacher. What they don't know is that Headmaster
Dumbledore got a Perl hacker to write a quick script to change the
professor's name on Hogwarts' web site. It read from standard input
and sent the result to standard output. What did the script look
like?

b) Harvard cardiologist Thomas Michel writes in his "Guide to
Politically Correct Cardiology" about the importance of using
inoffensive medical terms. For example, he suggests saying
"metabolically different" instead of the highly offensive "dead"[*].
Write a Perl program that reads a medical diagnosis (or any other
input) and uses "s///" to change the word "dead" to "metabolically
different". How are you going to avoid false matches such as
"deaden"?

[*] Yes, the paper is real and the scientist is real. Of course, he
wasn't taking himself too seriously when he wrote the paper.

c) American English is slightly different from UK English in several
respects, one of which is spelling[*]. For example, words ending in
"ise" in England generally end in "ize" in the United States, e.g.,
"exercise" becomes "exercize". Write a Perl program that "translates"
such words from American to UK English, i.e., changes words ending in
"ize" to end in "ise". For extra credit, take into account variations
like "exercizes" and "exercizing".

[*] Blame it on Noah Webster. The writer of the first dictionary in
America, Webster deliberately chose to spell words differently for
reasons both practical (making the language easier to learn for
immigrants) and patriotic (declaring linguistic independence).

d) Consider the following program:

#!/usr/bin/perl -w
use strict;

my $verse = "
99 bottles of beer on the wall.
99 bottles of beer.
Take one down, pass it around,
98 bottles of beer on the wall.\n\n";

print $verse;
$verse =~ s/(\d+)/ $1 - 1 /ge;
print $verse;

Notice the "e" option in the "s///" statement. What does it do?

(In case you're wondering, I don't drink beer. The words come from an
old and boring song which counts from 99 to 1.)

----------------------------------------

6) Answer to Previous Exercise

Here is a program that outputs your home directory and shell.

#!/usr/bin/perl -w
use strict;

my $uid = $<; # $< is system UID.

open PASSWD, "< /etc/passwd" or die "Couldn't open file.";

while ( defined( my $line = <PASSWD> ) ) {
if ( $line =~ m/^[^:]+:[^:]*:$uid:.*:(.*?):(.*?)$/ ) {
print "Home dir = $1\n";
print "Shell = $2\n";
}
}

close PASSWD;

Note that the regular expression demonstrates two ways to prevent
greediness. On the left side of the regular expression, I use /[^:]/
to ensure that the match can't include a colon, which separates the
fields. On the right side I use /.*?/, which causes the match to be
as short as possible.

----------------------------------------

7) Acknowledgements

A big thank you to Jacinta Richardson for suggestions and
proofreading. More advanced Perl users might want to check out the
free material from Perl Training Australia
<http://www.perltraining.com.au/>, which she is a part of.

Other contributors include Meryll Larkin.

----------------------------------------

8) Licensing

This course (i.e., all parts of it) is copyright 2003-2005 by Dan
Richter and Alice Wood, and is released under the same license as
Perl itself (Artistic License or GPL, your choice). This is the
license of choice to make it easy for other people to integrate your
Perl code/documentation into their own projects. It is not generally
used in projects unrelated to Perl.