Example: Writing a spam filter in C

A while ago someone mentioned being more interested in how to use C to
do everyday practical applications, like spam filtering, than in using
it to do things like generate Fibonacci numbers. I offered that I have
a little C program I use in spam filtering that I could post in stages.
Didn't get much response so I haven't done anything about it before now.
But I just got two spams this morning that slipped through my existing
filter, and I'm pissed and need to update my filter program anyway,
so why not do it here and see if anyone is interested?

First, let me lay out the basic idea behind what I'm trying to do.
I filter my mail with procmail. If you haven't already set up Procmail
and want to try it, I highly recommend the Infinite Ink Procmail
Quickstart, http://www.ii.com/internet/robots/procmail/qs/

I use a set of Procmail rules based on the packagge "spast" (Simple
Procmail Anti-Spam Template) which unfortunately doesn't seem to be
on the net any more, but here's what a basic spast rule looks like:
first, the file sets up a bunch of definitions like this:

SUBJECT=`$FORMAIL -x Subject:`
SUBJECT_REJECTS=$HOME/Procmail/subjectRejects
EGREP=/usr/bin/egrep

and then it uses those variables in a bunch of rules like this:

:0:
* ? (echo "$SUBJECT" | $FGREP -i -f $SUBJECT_REJECTS)
$SPAM_FILE

SUBJECT_REJECTS is a file ($HOME/Procmail/subjectRejects) where I list
phrases (one per line) like "viagra" and "make money" and "hot babes",
and any mail that comes in with this subject line gets filtered to my
spam folder, which I check periodically so I don't miss anything.

None of that has anything to do with C; it's just background. This
approach works fine as far as it goes, but there are a lot of things
that are hard to do with grep, but easy to do with a C program.
One thing I found early on was that I got a lot of spam in asian
languages, which didn't necessarily use the right charset in the
mail headers, so I couldn't rely on filtering on charset (though I
do that too). These messages just look like gibberish in my inbox
-- masses of consonants and punctuation marks. To my eye they're
obviously nothing I'm interested in, but how to get procmail to
recognize that?

So I wrote a program called isenglish.c to try to detect these messages.
(Note: at this point it actually can't tell English from Spanish from
French; mostly it just filters out asian languages that ought to have
been in other charsets but aren't. So the name is slightly misleading.)

The procmail rule looks like this:

ISENGLISH=$HOME/bin/isenglish
:0:
* ! ? (echo "$SUBJECT" | $ISENGLISH)
$SPAM_FILE

Okay, now we finally get into the C part of the lesson.
I'm just going to do a subset of the program at first; I'll flesh it
out in a later lesson. If you want to skip ahead and see the program
as I'm actually using it right now (it has some debugging and other
hooks in it), it's at http://shallowsky.com/software/isenglish.c

The executable is going to be called "isenglish" and it lives in a
subdirectory called bin of my home directory. It reads letters from
standard input, and since procmail will check the exit code of the
program, it's expected to exit with status 0 if it thinks the input is
english, or nonzero if it thinks it's not.

I'm going to have a routine called GetScore() which reads characters
and keeps a running score (high score means it's likely to be English,
low means likely not).

For this first lesson, let's look at these things:
- the mix of punctuation to letters
- the average word length

I'll annotate as I go, with C comments:

#include <stdio.h>
#include <ctype.h>

/* Returns a score from 0 to 100 indicating confidence that it's English */
int
GetScore()
{
/*
* I'll be doing statistics on the types of characters read.
* So I need a bunch of integer variables to store the
* running totals:
*/
int total = 0; /* total number of letters we've seen */
int punct = 0; /* number of punctuation characters */
int unprint = 0;
int alpha = 0;
int num = 0;
int words = 1; /* number of words -- we have at least one*/
int thiswordlength = 0;
int avwordlength = 0;
int totalNonSpace = 0;
int score;
char c;

/* Loop over characters from standard input */
while ((c = getchar()) != EOF)
{
/*
* Word length turns out to be a useful measure.
* I'm defining words to end at a space or unprintable character.
* It's also useful to see how many unprintables there are.
*/
if (isspace(c) || !isprint(c))
{
if (thiswordlength > 0)
{
++words;
avwordlength += thiswordlength;
thiswordlength = 0;
}
if (!isprint(c))
{
++unprint;
++totalNonSpace;
}
}
else
{
/* If it's not a space, then consider it as part of a word */
++thiswordlength;

if (isdigit(c))
++num;
else if (isalpha(c))
++alpha;
else
++punct;

++totalNonSpace;
}
++total;
}

/* We're done with the loop over input characters.
* Now we can start calculating statistics.
*/

/* Assuming we didn't end with a space,
* we haven't added thiswordlength yet, so add it now:
*/
avwordlength += thiswordlength;

/* Compare alphanum chars to punct chars */
score = (alpha + num) * 100 / totalNonSpace;
printf("percentage of alpha + num: %d\n", score);

/* Check word lengths */
avwordlength /= words;
printf("av word length: %d\n", avwordlength);

/* Unscientific: just root out extreme cases */
if (avwordlength < 3 && words > 5)
score /= 2;
if (avwordlength > 10)
score /= 2;

return score;
}

#define THRESHHOLD 55
main()
{
int score = GetScore();

if (score < THRESHHOLD)
{
printf("Score is %d: Not english.\n", score);
exit(1);
}
printf("Score is %d: Probably not english.\n", score);
exit(0);
}

--------------
That's the program. If you call it isenglish.c, you can compile it
like this:
cc -o isenglish isenglish.c

Now you can run it: just type "isenglish", then start typing stuff
into it, cut-and-paste from mail messages, cut-and-paste from the
program itself. Hit ^D when you're done typing, and the program
prints out whether it thinks the input is english or not.

This is just a simple program, at this point; but even something simple
like this can make a useful procmail filter, to filter out those
asian-language spam messages.

Is this useful? Please let me know. For the next lesson, if anyone
is interested, I'd like to add checks for things like "is this message
addressed to more than N people?"

...Akkana