Articles : Using rules for filtering spam

The lists of white and black rules are one of important mechanisms in filtering mail messages. Rules allow you to exactly identify a message as belonging to a certain class judging by the analysis of information in its headers or its body.

Each rule consists of one or several conditions. A separate condition contains a mask for finding a substring in a certain part of the message and logical operators. Masks are regular expressions - strings formatted in a special way that make it possible to effectively search text for sequences of characters with practically unlimited flexibility. The full syntax of regular expressions is rather large and difficult to learn, but you do not have to understand all details in most cases. To specify masks in typical cases, it is enough to have a general idea about the basic constructions of regular expression language. The full description of regular expression syntax is available in the corresponding literature or, for instance, here

It is possible to mark conditions as "Strong" and "Not". By default, a rule is applied only if all its conditions are met, but if a "Strong" condition is met, the rule is applied independently of other conditions. "Not" means a logical negation, i.e. the condition is met if no string matching the mask is found.

Let us take several examples of creating rules for filtering messages. The text contains the descriptions of rule conditions in the form that is used in the list of conditions of a certain rule. For example, a construction like "Header{Subject} =~ some text" means that the condition for filtering the subject header by the mask "some text" has been created for the rule.

1) It is necessary to filter out as spam all messages containing various variants of the word viagra in their subjects.

To do it, you can create a new black rule with the following condition:

Header{Subject} =~ v.{0,2}i.{0,2}a.{0,2}g.{0,2}r.{0,2}a

Words that have the letters of the word viagra in the same order with some number of any other characters between them (from 0 to 2) will match this mask. That is, such words as v11i22agra, viaaggra, etc. will match this mask. The character '.' is used in the mask in place of any other character and the construction "{0,2}" right after the dot means that a combination of any characters from 0 to 2 can take this place.

2) It is necessary to filter out as spam all headers that contain an exclamation mark and a question mark at the same time.

Create a black rule with the following condition:

Header{Subject} =~ \?.*!|!.*\?

The character '?' is a special character so you should insert '\' before it in the mask to make the plug-in treat it as a regular character. The character '*' after the dot means that there may be 0 or more previous characters in the text. The mask consists of two conditions separated with the character '|'. The first part of the mask searches for strings where the question mark comes first and the exclamation mark comes second. On the opposite, the second part of the mask searches strings where the exclamation mark comes first. Both cases match the mask since the character '|' means the logical operation OR.

3) It is necessary to filter out as spam the variants of the word viagra where one of the characters "1lj" may take the place of the character "i".

Header{Subject} =~ v[1lj]agra

Characters that may take the place of 'i' are enumerated in square brackets. It means that all these variants match: v1agra, vlagra, vjagra.

4) All messages with a certain string in their text should not be filtered out.

To do it, create the following white rule:

Body =~ key_string

The filter will search for the "key_string" string in the body of each message and consider such messages as non-spam.

5) Deleting messages where the sender's address is empty (a black rule):

not Header{From} =~ \S+

6) Deleting messages sent to a certain mailbox, but having no email address of this mailbox in their To and CC fields (a black rule with three conditions):

Header{Received} =~ mailbox@domain\.com
not Header{To} =~ mailbox@domain\.com
not Header{CC} =~ mailbox@domain\.com 

Where is your e-mail address. The first condition checks if the message is sent to the specified address. The other two conditions check whether this address is specified in the To and CC fields. The rule is applied when all conditions are met, i.e. the message is sent to a certain address and contains this address neither in the To field nor in the CC field. If you have several mailboxes, you can create a separate rule like this for each of them.

7) Blocking messages in certain encoding (a black rule):

Header{Content-Type} =~ iso-2022-jp

This rule will block messages in the Japanese encoding. To block other encodings, you should specify the necessary name in the condition.

8) Blocking messages with several addresses in the To field that begin with the same characters (a black rule):

Header{To} =~ \b<?([\w\-.]{2})[^@, ]*@.*(?:\b<?\1[^@,]*@.*){3}

This rule will block messages with 4 or more addresses in the To field whose first two characters are the same. Number {2} in the condition defines the number of characters that should be compared at the beginning of the addresses. Number {3} at the end of the expression defines the minimum number of addresses with the same characters at the beginning that is needed for the rule to be applied minus one. It means that, for example, in order to search for the minimum of five addresses, you should replace {3} at the end of the expression with {4}.

9) If you want some rule to be applied only to a certain mailbox, you should add one more condition to it:

Header{Received} =~ mailbox@domain\.com

Where is the e-mail address of the mailbox the rule is to be applied to.

You can use RegExpCheck tool for testing the patterns before applying them in rules.

You can download a set of black rules for filtering out spam from here:

You should import the rules from the file in the dialog box where you edit black rules. Note that black rules are used to filter out messages as spam and, as a result, the filter will delete the corresponding messages from the server if they do not match one of the white rules or other white conditions. Before you use the filter, you should test that none of existing normal messages match these black rules. In case it happens, you can disable unnecessary rules by clearing the checkbox in the corresponding list.

Main rules and metacharacters in regular expression.

1. Any character represents itself if it is not a metacharacter. If you need to disable a metacharacter, put '\' before it.

2. A string of characters represents a string of these characters.

3. A set of possible characters (class) is enclosed in square brackets '[]', it means that one of the characters in brackets can take this place. If the first character in brackets is '^', it means that none of the specified characters can take this place in the expression. You can use the characters '-' inside a class. It means a range of characters. For example, a-z means one of the lowercase Latin letter, 0-9 is a digit, etc.

4. All character, including special ones, can be specified with the help of '\' like in the C language.

5. Alternative strings are separated with the character '|'. Note that it is a regular character inside square brackets.

6. It is possible to specify "submasks" inside expressions enclosing them in parentheses and refer to them as '\number'. The first parenthesis is specified as '\1'.

\ - consider the following metacharacter as a regular character.
^ - line start
. - any character. Except '\n' - line end.
$ - line end
| - alternative (or)
() - grouping
[] - character class
Metacharacters have modifiers (written after a metacharacter):
* - repeated 0 times or more
+ - repeated 1 time or more
? - 1 or 0 times
{n} - n times exactly
{n,} - at least n times
{n,m} - not less than n times, not more than m times
In all other cases, braces are considered as regular characters. Thus, '*' is equal to {0,} , '+' - {1,} and '?' - {0,1}. n and m may not be larger than 65536.
\t - tab character
\n - new line
\r - carriage return
\? - format return
\v - vertical tabulation
\a - alarm
\e - escape
\033 - octal character representation
\x1A - hexadecimal character representation
\c[ - control character
\l - lower case for the following character
\u - upper case for the following character
\L - all characters in lower case till \E
\U - all characters in upper case till \E
\E - case change limit
\Q - cancel the action as a metacharacter
\w - alphanumerical or '_' character
\W - non-alphanumerical character
\s - one space
\S - one non-space
\d - one digit
\D - one non-digit
\b - word limit
\B - not a word limit
\A - line start