antispamsniper.com

dembrey · Joined: 01 Aug 2006 Posts: 9 Location: Dalton, UK

Quite a high proportion of my spam contains mainly Russian, Chinese or Japanese characters. Since I do not have any meaningful emails that contain these characters, would it be possible to set up a filter to delete these messages based upon frequently occurring characters in these (or other non-Roman text) languages, either by the the user or developer of the software?

vetaltm · Author Joined: 05 Feb 2006 Posts: 754

- If you never receive good emails having non-Roman encodings, the content classifier should recognize most of them with a high spam ratio. Just train the plug-in on several messages per language and the subsequent emails having the words from unwanted languages will be classified as spam with high probability.

- It is possible to use black rules for classifying the messages by the charset code in Content-Type header. Here is a rule that recognizes by headers most of the messages in Russian, Japanese and Chinese languages:

dembrey · Joined: 01 Aug 2006 Posts: 9 Location: Dalton, UK

I installed the first two filters that you suggested (but not the third as I might receive wanted messages from japan or Russia but with Roman text). This morning I still have new messages with Japanese, Chinese and Cyrillic scripts as shown below:

From: サクラなし <shigerutakeuchi@mail.goo.ne.jp>
Subject: 素人が集まる超優良サイト

From: Борислава Дербененко <burke@escortcorp.com>
Subject: RE[9]: Снятие с учета в ГАИ - БЕСПЛАТНО

Subject: 地元のオバサンを抱きたいですか？レベ X-Spam-Level: 8/5

As you can see, the last message has X-Spam level: 8/5 appended. Is this a filter working? The Spam blocker in my AV program (F-secure) is turned off.

I was wondering if I had installed the filters correctly using the black rules dialogue. I set up a rule called NonRomanChars and the condition I am testing on is:

Header{Content-Type} =~ windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5

Maybe it should be Header{From}? Please advise.

Incidentally I also set up another filter called Body HTML NonRoman which corresponds to the other parameters you supplied, i.e. the filter condition is:

Body =~ charset="?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5

Again can you confirm that this is correct?

Thank you for your assistance.

vetaltm · Author Joined: 05 Feb 2006 Posts: 754