(Bayesian) x (spam filters)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

(Bayesian) x (spam filters)

Joe Spinden
The following article from the NY Times crosses two topics of interest here,
albeit briefly.

 

 

 

SP@M SHEN@NIG@NS!!

That Gibberish in Your In-Box May Be Good News

By GEORGE JOHNSON

 

Published: January 25, 2004

 

 

F you could sit back with Zen-like detachment and observe the dross piling
up in your electronic mailbox, the spam wars might come to seem like a
fascinating electronic game. Like creatures running through a maze with
constantly shifting walls, spammers dart and weave to sneak their
solicitations past ever wilier junk mail filters. They are organisms, or
maybe genomes, grinding out one random mutation after another, desperately
trying to elude the Grim Reaper.

 

Viagra becomes "vi@gra" or "v-i-@-g-r-a." Then, as the filters adapt,
"v1@gr@" and even "\/l@gr@." Currently, the Internet is swarming with
mutants like this: "Cheap Val?(u)m, Viagr@, X(a)n@x, Som@ Di3t Pills Many
M3ds RIZfURqgHr77B," the final string of gibberish hanging like an appendage
of junk DNA.

 

Taking a different approach, a come-on for barnyard pornography devolves
into "faurm galz bing e rottic." Another pitch promises to reveal "Seakrets
of ((eks-eks-eks)) stars."

 

Dispiriting as it is to start the morning with a hundred of these
orthographic monsters crouching in your in-box, there is reason to take
heart. Measured in bits and bytes, the sheer volume of spam may not have
diminished. But advanced filtering software, which learns to recognize the
mercurial traits of junk e-mail, is having an effect. The spammers' messages
are becoming harder and harder to decipher. Sense is inevitably degenerating
into nonsense, like a pileup of random mutations in an endangered species
gasping its last breaths.

 

Earlier this month, when Internet experts met in Cambridge, Mass., for the
2004 Spam Conference (available as a Web broadcast at spamconference.org),
they showed just how far the science of spam fighting has come. For all the
recent talk of suing spammers and compiling a national do-not-spam list,
most speakers were putting their hopes in technological, not legal
solutions. The federal government's new junk e-mail law, the Can Spam Act,
barely rated a mention.

 

Terry Sullivan, a spam researcher with a doctorate in information science,
described how he used a "handy 10-dimensional high-fidelity model of
historical spam space" to analyze how junk e-mail changes over time. Long
stretches of stability are suddenly interrupted by brief bursts of
innovation, a pattern he compared to what some evolutionary biologists call
punctuated equilibrium. The encouraging news is that there is enough
stability - an enduring core of "spamminess" - for the invaders to be
quickly identified and destroyed.

 

Another presentation, called "Cockroaches Hate the Light," considered how to
authenticate senders so that spammers can't easily fake their identities.
Other speakers proposed eco-electronic solutions like digital postage stamps
that would put a price on sending e-mail - trivial for an individual user
but making hit-or-miss barrages prohibitively expensive.

 

Like epidemiologists discussing how to predict and control a biological
outbreak, conferencegoers compared the merits of various filtering
techniques. Which is better: first-order Bayesian, token grab bag, sparse
binary polynomial hash or markovian weighting? The meaning of the terms may
be opaque to outsiders, but the underlying message comes through: the
spammers are up against some increasingly advanced cybernetic artillery.

 

Many experts believe that solving the spam problem will require a
combination of approaches. But laws take forever to pass and amend.
Technological fixes like sender authentication and electronic stamps would
also take time to carry out, but filtering is already here - and it is
reducing the spammers' messages to feeble signals swamped by a roar of
alphanumeric noise.

 

The turning point came in August 2002 when a computer scientist, Paul
Graham, issued a manifesto called "A Plan for Spam," describing how to
filter e-mail using a statistical method discovered in the 18th century by
the English theologian and mathematician Thomas Bayes. Bayesian e-mail
filters had been studied for years, but with Mr. Graham's paper the idea
went mainstream.

 

Presented with thousands of examples of good and bad e-mail, a Bayesian
filter compiles a list ranking each word according to how likely it is to
appear in junk e-mail. Rising to the top of the roster are high scorers like
Valium, Xanax, mortgage, porn and Viagra. Settling toward the bottom are
words like deciduous, cashmere and intensify. Hovering in the middle are the
vast number of neutral words that can swing either way.

 

When a new piece of e-mail arrives, the filtering program counts up the
words and computes an overall ranking. If the number exceeds a certain
threshold, the message is rejected as spam.

 

A message from a friend saying that she is so worried about refinancing her
mortgage that she took a Valium will pique the filter's interest. But most
of the text will probably consist of words with neutral or very low
rankings, dragging down the score and allowing the e-mail to go through.

 

If a spam promising "l0w m0rtg@ge rates" slips by, the filter is informed by
the user that it has made an error. The mutation is then moved higher on the
list, as well as future mutations of the mutation, until the spammer is
reduced to sending gobbledygook. A recent e-mail message making the rounds
promised "Leacatharsisrn to make a fortcongestiveune on eBay!" (A Web link
inside led to a site with information on a money-making auction scheme.)

 

Increasingly the subject lines convey no meaning at all: "begonia breadfruit
extempore defocus purveyor." For the spammer, the hope, slim as it seems, is
that a few curious souls will open and read the e-mail, which begins, "I
finally was able to lsoe the wieght" and goes on to offer a product
"Guanarteed to work or your menoy back!" Read out loud, the message sounds a
little like HAL the computer in "2001: A Space Odyssey" sinking into aphasia
as its synapses are severed one by one.

 

In what may be their final death throes, some spammers have begun sending
messages consisting of a single image or a one-line sales pitch -
"picospams" - with a link to a Web site. Often appended at the end, in an
attempt to flummox the filters, is a scrap of Dadaist poetry - "feverish
squirt feat transconductance terrify broken trite fascist axis stultify floc
bookshelves. " Sometimes this "word salad," as it has come to be called, is
rendered in invisible ink - white letters on a white background - or hidden
inside an embedded formatting command.

 

No matter. The filters learn to adapt. If the spammers want to stay in
business, ultimately they must convey at least a hint of meaning. After all,
you cannot send a completely random message - or one that is blank - and
expect many people to click the link.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://constantinople.hostgo.com/pipermail/friam_redfish.com/attachments/20040126/6a3f30d0/attachment.htm