09 Part 8: The Substitution Operator

LinuxChix Perl Course Part 8: The "s///" Operator

1) Introduction
2) The "s///" Operator
2) Options
3) Greediness
4) Exercise
5) Answer to Previous Exercise
6) Past Information
7) Credits
8) Licensing

-----------------------------------

1) Introduction

We're getting to the best part now. The "s///" operator is one of the most
powerful operators in Perl. You'll find yourself using it again and again.

Let me also mention that I'll be on vacation for the next two weeks, so you
won't hear from me again until September 12. But I encourage you to send your
exercises to the mailing list and to give suggestions on other people's
exercises.

-----------------------------------

2) The "s///" Operator

The "s///" operator performs a regular expression substitution. For example,
try the following program:

#!/usr/bin/perl -w
use strict;

my $text = 'food';
$text =~ s/foo/bear/;
print "$text\n";

The command "s/X/Y/" replaces the first instance of the regular expression X
with the text Y. (We'll see how to replace all instances in a moment.) If the
regular expression X is not found anywhere in the string, the command is
silently ignored.

A common use for "s///" is to remove text entirely (by specifying that the
text is to be replaced by nothing):

$witness_data =~ s/Name: \w+//; # Anonymise data by removing name.

As you have probably guessed, "s///" can use any delimiter, just like "m//"
and "tr///":

$text =~ s/foo/bar/;
$text =~ s!foo!bar!;
$text =~ s<foo><bar>;

The "s///" operator can also use parentheses just like "m//":

# Munge e-mail address. Note the "escaped at": @ is a special character.
$address =~ s/(.+)\@([^@]+)/$1 at $2/;
# Not a very logical way to do it; just for demonstration.

Note: you may see "\1" used instead of "$1". Both work, but "$1" is the
preferred method.

The "s///" operator accepts all of the options we discussed for "m//", and
they have the same meaning:
i Do case-insensitive pattern matching.
m Treat string as multiple lines (let /^/ and /$/ match "\n").
s Treat string as single line (let /./ match "\n").
g Match globally, i.e., change all occurrences.
We will also discuss an additional option:
e Evaluate the replacement as Perl code.

Examples:

# Convert "foo", "Foo" or "FOO" to "bar".
$text =~ s/foo/bar/i;

# Reply to an e-mail. /^/ and /$/ mean beginning and end of line.
$headers =~ s/^Subject: (.*)$/Subject: Re: $1/m;

The "g" option still means "global", but with "s///" it means "replace all
instances" rather than just "match all instances":

# Convert all C++ comments to C comments.
$C_code =~ s#//(.*)$#/*$1*/#mg;

Note that we used the "$1" variable even though there are multiple matches; $1
is re-evaluated for each match.

You won't use the "e" option very often, but it comes in very handy when you
do need it. For example:

# Convert Japanese yen to euros.
$price_list =~ s/([0-9.]+) yen/ ($1 * $conversion_factor) . ' euros' /ge;

-----------------------------------

3) Greediness

Regular expressions in Perl are "greedy" by default, meaning that * and + will
match as many characters as possible. For example, in the following regular
expression:

my($answer) = ( $foo =~ /The answer is: (.+)/ );

the /.+/ will match the whole answer, even though it would theoretically be
correct to only match the first character of the answer, or the first few
characters.

Usually you want to be greedy, but not always. If you don't want to be greedy,
add a question mark after the * or +. This doesn't change WHETHER there's a
match; it only changes the CONTENT of the match.

For example, if you want to strip HTML tags:

$html =~ s/<.*>//sg; # Oops: greediness got us here.

That doesn't work because Perl sees "<html>...</html>" as one big /<.*>/ (it
starts with "<" and ends with ">", right?). So it performs one replacement on
the entire HTML document, replacing the whole document with nothing! We need
to turn off the greediness:

$html =~ s/<.*?>//sg; # Much better.

This causes the /.*/ to be as short as possible while still making a match.

Of course, we could do this instead:

$html =~ s/<[^>]*>//sg; # Also works.

But it's not always that easy. Consider this code to strip C-style comments
from your C code:

# Remove /* ... */ comments from your C code.
$C_code =~ s</\*(.*?)\*/><>sg;
# (Asterisks are escaped. Parentheses are provided only for clarity.
# The <> is the delimiter.)

Once you get your head around that regular expression, you'll see that no
greedy regular expression would do.

-----------------------------------

4) Exercises

a) Harvard cardiologist Thomas Michel writes in his "Guide to Politically
Correct Cardiology" about the importance of using inoffensive medical terms.
For example, he suggests saying "metabolically different" instead of the
highly offensive "dead"[*]. Write a Perl program that reads a medical
diagnosis (or any other input) and uses "s///" to change the word "dead" to
"metabolically different". Don't forget to use the "g" option to replace ALL
matches, and to use /\b/ (word break) to avoid false matches such as "deaden"
or "deadly".

[*] Yes, the paper is real and the scientist is real. Of course, he wasn't
taking himself too seriously when he wrote the paper.

b) American English is slightly different from British English in several
respects, one of which is spelling[*]. For example, words ending in "ise" in
England generally end in "ize" in the United States, e.g., "exercise" becomes
"exercize". Write a Perl program that "translates" such words from American to
British English, i.e., changes words ending in "ize" to end in "ise". For
extra credit, take into account variations like "exercizes" and "exercizing".

[*] Blame it on Noah Webster. The writer of the first dictionary in America,
Webster deliberately chose to spell words differently for reasons both
practical (making the language easier to learn for immigrants) and patriotic
(declaring linguistic independence).

-----------------------------------

5) Answer to Previous Exercise

The previous exercise was to write a program to help you rhyme, where two
words are defined to "rhyme" if they have the last three letters in common.
Here is an example of such a program.

#!/usr/bin/perl -w
use strict;

my $possibilities = "Terence, this is stupid stuff:
You eat your victuals fast enough;
There can't be much amiss, 'tis clear,
to see the rate you drink your beer.
But oh, good Lord, the verse you make,
It gives a chap the belly-ache.";

# The beginning of a poem by A. E. Housman

while ( defined( my $line = <STDIN> ) ) {
if ( my($last_letters) = ( $line =~ /(\w?\w?\w)\W*$/m ) ) {
if ( my($match) = ( $possibilities =~ /^(.*$last_letters\W*)$/mi ) ) {
print "That rhymes with: $match\n";
}
else {
print "I can't think of anything that rhymes with that.\n";
}
}
else {
print "Sorry: I couldn't even begin to rhyme that one.\n";
}
}

The first regular expression:
if ( my($last_letters) = ( $line =~ /(\w?\w?\w)\W*$/m ) ) {
means three "word characters", possibly followed by non-word characters,
followed by the end of the line (or the end of the string). I used \w for
brevity rather than [A-Za-z] even though \w includes digits and the underscore
character: there's no way for the computer to guess that "3" rhymes with
"see" anyway.

The question marks are there to anticipate the possibility that the last word
might be less than three letters long. (Our definition of "rhyming" didn't
actually take into account this case, but I considered it anyway.)

The last-three-letters algorithm has few false positives but a lot of false
negatives (e.g., it wouldn't recognise that any of the lines in the above poem
by Housman rhyme). To try to improve accuracy, I wrote a regular expression
that interprets the rhyming part of the word as follows:
a) if the word ends in a "y", the rhyming part is the "y" and any vowels
immediately before it: "flY", "plAY".
b) if the word ends in an "e" not directly following a vowel, the rhyming part
is defined like in (c), excecpt the silent "e" is treated as a consonant:
"possIBLE", "remAKE".
c) otherwise, the rhyming part is the part containing the last group of one or
more vowels followed by any consonants: "shACK", "sEA", "sEE".

The regular expression is as follows:

my $vowel = '[aeiou]';
my $consonant = '[b-df-hj-np-tv-z]';
my $consonant_or_e_not_y = '[b-hj-np-tvwxz]';

my $match_ending_letter = '[aiouy]';
my $match_continuing_letter = $consonant;

while ( defined( my $line = <STDIN> ) ) {
if ( my($last_letters) = ( $line =~
/($vowel*$consonant*$consonant_or_e_not_y|$vowel*$match_ending_letter)\W*$/im
) ) {

Though impressive, in practice it's not any more accurate than the last-three-
letters algorithm, and it's probably less accurate. Oh, well.

Finding rhymes is truly not for the fainthearted programmer. For example, the
words "food" and "good" don't rhyme even though their spelling suggests that
they would, and the word "read" can be pronounced either like "reed" or "red",
depending on its function in the sentence.

But if you ever do get really good at teaching poetry to computers, perhaps
you can join Damian Conway, "the mad scientist of Perl", who wrote a Perl
module to automatically generate a haiku before displaying an error message.
As he puts it:

When a program dies
what you need is a moment
of serenity.

-----------------------------------

7) Past Information

Part 1: Getting Started
http://linuxchix.org/pipermail/courses/2003-March/001147.html

Part 2: Scalar Data
http://linuxchix.org/pipermail/courses/2003-March/001153.html

Part 3: User Input
http://linuxchix.org/pipermail/courses/2003-April/001170.html

Part 4: Control Structures
http://linuxchix.org/pipermail/courses/2003-April/001184.html

Part 4.5, a review with a little new information at the end:
http://linuxchix.org/pipermail/courses/2003-July/001297.html

Part 5: The "tr///" Operator
http://linuxchix.org/pipermail/courses/2003-July/001302.html

Part 6: The "m//" Operator
http://linuxchix.org/pipermail/courses/2003-August/001305.html

Part 7: More About "m//"
http://linuxchix.org/pipermail/courses/2003-August/001322.html

-----------------------------------

8) Credits

Works cited: "man perlop"

Thanks to Jacinta Richardson for fact checking.

-----------------------------------

9) Licensing

This course (i.e., all parts of it) is copyright 2003 by Alice Wood and Dan
Richter, and is released under the same license as Perl itself (Artistic
License or GPL, your choice). This is the license of choice to make it easy
for other people to integrate your Perl code/documentation into their own
projects. It is not generally used in projects unrelated to Perl.