antispamsniper.com Forum Index antispamsniper.com
The reliable anti-spam protection
 
 FAQFAQ   SearchSearch     ProfileProfile   Log inLog in   RegisterRegister 

Non-Roman characters in messages

 
Post new topic   Reply to topic    antispamsniper.com Forum Index -> AntispamSniper for TheBat!
View previous topic :: View next topic  
Author Message
dembrey



Joined: 01 Aug 2006
Posts: 9
Location: Dalton, UK

PostPosted: Tue Sep 25, 2007 1:08 pm    Post subject: Non-Roman characters in messages Reply with quote

Quite a high proportion of my spam contains mainly Russian, Chinese or Japanese characters. Since I do not have any meaningful emails that contain these characters, would it be possible to set up a filter to delete these messages based upon frequently occurring characters in these (or other non-Roman text) languages, either by the the user or developer of the software?
Back to top
View user's profile Send private message
vetaltm
Author


Joined: 05 Feb 2006
Posts: 751

PostPosted: Tue Sep 25, 2007 7:08 pm    Post subject: Reply with quote

- If you never receive good emails having non-Roman encodings, the content classifier should recognize most of them with a high spam ratio. Just train the plug-in on several messages per language and the subsequent emails having the words from unwanted languages will be classified as spam with high probability.

- It is possible to use black rules for classifying the messages by the charset code in Content-Type header. Here is a rule that recognizes by headers most of the messages in Russian, Japanese and Chinese languages:
Code:
Header{Content-Type} =~ windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5


The following rule detects HTML messages with the unwanted charset codes in HTML meta tags (in message body):
Code:
Body =~ charset="?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5

You can look at the sources of missed emails (select a message press F9 in TheBat) and add the additional charset codes to these rules, delimited with "|".

- This rule will block the messages from certain countries by the name of corresponding top-level domain:
Code:
Header{Received} =~ \S+\.jp\s*|\S+\.cn\s*|\S+\.ru\s*
Back to top
View user's profile Send private message Send e-mail
dembrey



Joined: 01 Aug 2006
Posts: 9
Location: Dalton, UK

PostPosted: Wed Sep 26, 2007 12:13 pm    Post subject: Reply with quote

I installed the first two filters that you suggested (but not the third as I might receive wanted messages from japan or Russia but with Roman text). This morning I still have new messages with Japanese, Chinese and Cyrillic scripts as shown below:

From: サクラなし <shigerutakeuchi@mail.goo.ne.jp>
Subject: 素人が集まる超優良サイト

From: Борислава Дербененко <burke@escortcorp.com>
Subject: RE[9]: Снятие с учета в ГАИ - БЕСПЛАТНО

Subject: 地元のオバサンを抱きたいですか?レベ X-Spam-Level: 8/5

As you can see, the last message has X-Spam level: 8/5 appended. Is this a filter working? The Spam blocker in my AV program (F-secure) is turned off.

I was wondering if I had installed the filters correctly using the black rules dialogue. I set up a rule called NonRomanChars and the condition I am testing on is:

Header{Content-Type} =~ windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5

Maybe it should be Header{From}? Please advise.

Incidentally I also set up another filter called Body HTML NonRoman which corresponds to the other parameters you supplied, i.e. the filter condition is:

Body =~ charset="?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5

Again can you confirm that this is correct?

Thank you for your assistance.
Back to top
View user's profile Send private message
vetaltm
Author


Joined: 05 Feb 2006
Posts: 751

PostPosted: Wed Sep 26, 2007 4:55 pm    Post subject: Reply with quote

dembrey wrote:

Subject: 地元のオバサンを抱きたいですか?レベ X-Spam-Level: 8/5
As you can see, the last message has X-Spam level: 8/5 appended. Is this a filter working? The Spam blocker in my AV program (F-secure) is turned off.

The plug-in doesn't change the message subjects. Perhaps this substring was added by some server-level filter.

dembrey wrote:

I was wondering if I had installed the filters correctly using the black rules dialogue. I set up a rule called NonRomanChars and the condition I am testing on is:

Header{Content-Type} =~ windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5

Maybe it should be Header{From}? Please advise.

Make sure that the rule condition is added properly:
- Press Add in the Rule dialog
- Enter windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5 in "Apply expression" field
- Select "To header" item
- Enter Content-Type in the header name field

The "Content-Type" message header contains a charset code, used by email client for displaying the message text. You can see it in the sources of missed messages by pressing F9 in TheBat. Usually the Content-Type header content looks like this:
Code:
Content-Type: text/plain; charset=koi8-r

The charset code is included after "charset=" substring. If a missed message contains the charset code, which is not included in rule condition, add it manually to the tail of expression, separated from other codes with the symbol "|".

dembrey wrote:

Incidentally I also set up another filter called Body HTML NonRoman which corresponds to the other parameters you supplied, i.e. the filter condition is:

Body =~ charset="?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5

Again can you confirm that this is correct?

Yes, but here is a more correct condition, which is covering more cases:
Code:
Body =~ charset\s*=\s*"?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic    antispamsniper.com Forum Index -> AntispamSniper for TheBat! All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group