07 Part 6: Data Extraction with the Match Operator

LinuxChix Perl Course Part 6: Data Extraction with m//

1) Introduction
2) Match Extraction
3) Greedy Matching
4) Exercise
5) Answers to Previous Exercises
6) Acknowledgements
7) Licensing


----------------------------------------

1) Introduction

The m// operator is actually much more powerful than we saw in the
last lesson. In this lesson we're going to see how to use it to
extract information.

----------------------------------------

2) Match Extraction

The m// operator allows you to extract text from a string. Simply
enclose the relevant part of the pattern in parentheses. The content
of the first set of parentheses will be stored in the variable $1,
the second in $2, etc.

Example:
# Match digits-colon-digits-colon-digits
if ( $time =~ /(\d+):(\d+):(\d+)/ ) {
$hour = $1; # First set of parentheses
$minute = $2; # Second set of parentheses
$second = $3; # Third set of parentheses
}
else {
print "Time wasn't valid.\n";
}

Alternatively, m// returns a list of the matches, so we can compress
the above example into one line:

if ( ($hour, $minute, $second) = ( $time =~ /(\d+):(\d+):(\d+)/ ) )

When extracting text, we don't technically need to put the m// in an
"if" statement, but it's highly recommended. You should always take
into account the possibility that the pattern doesn't match.

----------------------------------------

3) Greedy Matching

Examine the following program. What will it output?

#!/usr/bin/perl -w
use strict;

my $html = '<p>foo</p> <p>bar</p> <p>baz</p>';

if ( $html =~ m:<p>(.*)</p>: ) {
print "$1\n";
}

It doesn't output "foo", "bar" or "baz". Instead, it outputs "foo</p>
<p>bar</p> <p>baz" This is because the asterisk wildcard is "greedy":
it always includes the maximum amount of text possible.

If you don't want a greedy match, add a question mark after the
asterisk:

if ( $html =~ m:<p>(.*?)</p>: ) # Output is "foo".

Ditto for the plus sign:

if ( $html =~ m:<p>(.+?)</p>: )

Greediness never affects WHETHER there is a match; it only affects
exactly what text is matched.

----------------------------------------

4) Exercise

Modify your program from last week's exercise, the one that reads
/etc/passwd, so that it outputs only your home directory and your
shell (not the rest of the line).

----------------------------------------

5) Answers to Previous Exercises

a) The following program reads /etc/passwd and outputs the line
corresponding to my account name:

#!/usr/bin/perl -w
use strict;

my $username = 'dan';

open PASSWD, "< /etc/passwd" or die "Couldn't open file.";

while ( defined( my $line = <PASSWD> ) ) {
if ( $line =~ m/^$username:/ ) {
print $line;
}
}

close PASSWD;

To avoid hard-coding the account name, we could use the UID instead:

#!/usr/bin/perl -w
use strict;

my $uid = $<; # $< is system account number (UID).

open PASSWD, "< /etc/passwd" or die "Couldn't open file.";

while ( defined( my $line = <PASSWD> ) ) {
if ( $line =~ m/^[^:]+:[^:]*:$uid:/ ) {
print $line;
}
}

close PASSWD;

b) This program reads a C++ file and outputs all #include statements:

#!/usr/bin/perl -w
use strict;

open IN, "< foo.c" or die "Couldn't open file.";

while ( defined( my $line = <IN> ) ) {
if ( $line =~ m/^\s*#include/ ) {
print $line;
}
}

close IN;

----------------------------------------

6) Acknowledgements

A big thank you to Jacinta Richardson for suggestions and
proofreading. More advanced Perl users might want to check out the
free material from Perl Training Australia
<http://www.perltraining.com.au/>, which she is a part of.

Other contributors include Meryll Larkin.

----------------------------------------

7) Licensing

This course (i.e., all parts of it) is copyright 2003-2005 by Dan
Richter and Alice Wood, and is released under the same license as
Perl itself (Artistic License or GPL, your choice). This is the
license of choice to make it easy for other people to integrate your
Perl code/documentation into their own projects. It is not generally
used in projects unrelated to Perl.