View previous topic :: View next topic |
Author |
Message |
dembrey
Joined: 01 Aug 2006 Posts: 9 Location: Dalton, UK
|
Posted: Tue Sep 25, 2007 1:08 pm Post subject: Non-Roman characters in messages |
|
|
Quite a high proportion of my spam contains mainly Russian, Chinese or Japanese characters. Since I do not have any meaningful emails that contain these characters, would it be possible to set up a filter to delete these messages based upon frequently occurring characters in these (or other non-Roman text) languages, either by the the user or developer of the software? |
|
Back to top |
|
|
vetaltm Author
Joined: 05 Feb 2006 Posts: 751
|
Posted: Tue Sep 25, 2007 7:08 pm Post subject: |
|
|
- If you never receive good emails having non-Roman encodings, the content classifier should recognize most of them with a high spam ratio. Just train the plug-in on several messages per language and the subsequent emails having the words from unwanted languages will be classified as spam with high probability.
- It is possible to use black rules for classifying the messages by the charset code in Content-Type header. Here is a rule that recognizes by headers most of the messages in Russian, Japanese and Chinese languages:
Code: | Header{Content-Type} =~ windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5 |
The following rule detects HTML messages with the unwanted charset codes in HTML meta tags (in message body):
Code: | Body =~ charset="?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5 |
You can look at the sources of missed emails (select a message press F9 in TheBat) and add the additional charset codes to these rules, delimited with "|".
- This rule will block the messages from certain countries by the name of corresponding top-level domain:
Code: | Header{Received} =~ \S+\.jp\s*|\S+\.cn\s*|\S+\.ru\s* |
|
|
Back to top |
|
|
dembrey
Joined: 01 Aug 2006 Posts: 9 Location: Dalton, UK
|
Posted: Wed Sep 26, 2007 12:13 pm Post subject: |
|
|
I installed the first two filters that you suggested (but not the third as I might receive wanted messages from japan or Russia but with Roman text). This morning I still have new messages with Japanese, Chinese and Cyrillic scripts as shown below:
From: サクラなし <shigerutakeuchi@mail.goo.ne.jp>
Subject: 素人が集まる超優良サイト
From: Борислава Дербененко <burke@escortcorp.com>
Subject: RE[9]: Снятие с учета в ГАИ - БЕСПЛАТНО
Subject: 地元のオバサンを抱きたいですか?レベ X-Spam-Level: 8/5
As you can see, the last message has X-Spam level: 8/5 appended. Is this a filter working? The Spam blocker in my AV program (F-secure) is turned off.
I was wondering if I had installed the filters correctly using the black rules dialogue. I set up a rule called NonRomanChars and the condition I am testing on is:
Header{Content-Type} =~ windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5
Maybe it should be Header{From}? Please advise.
Incidentally I also set up another filter called Body HTML NonRoman which corresponds to the other parameters you supplied, i.e. the filter condition is:
Body =~ charset="?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5
Again can you confirm that this is correct?
Thank you for your assistance. |
|
Back to top |
|
|
vetaltm Author
Joined: 05 Feb 2006 Posts: 751
|
Posted: Wed Sep 26, 2007 4:55 pm Post subject: |
|
|
dembrey wrote: |
Subject: 地元のオバサンを抱きたいですか?レベ X-Spam-Level: 8/5
As you can see, the last message has X-Spam level: 8/5 appended. Is this a filter working? The Spam blocker in my AV program (F-secure) is turned off.
|
The plug-in doesn't change the message subjects. Perhaps this substring was added by some server-level filter.
dembrey wrote: |
I was wondering if I had installed the filters correctly using the black rules dialogue. I set up a rule called NonRomanChars and the condition I am testing on is:
Header{Content-Type} =~ windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5
Maybe it should be Header{From}? Please advise.
|
Make sure that the rule condition is added properly:
- Press Add in the Rule dialog
- Enter windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5 in "Apply expression" field
- Select "To header" item
- Enter Content-Type in the header name field
The "Content-Type" message header contains a charset code, used by email client for displaying the message text. You can see it in the sources of missed messages by pressing F9 in TheBat. Usually the Content-Type header content looks like this:
Code: | Content-Type: text/plain; charset=koi8-r |
The charset code is included after "charset=" substring. If a missed message contains the charset code, which is not included in rule condition, add it manually to the tail of expression, separated from other codes with the symbol "|".
dembrey wrote: |
Incidentally I also set up another filter called Body HTML NonRoman which corresponds to the other parameters you supplied, i.e. the filter condition is:
Body =~ charset="?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5
Again can you confirm that this is correct?
|
Yes, but here is a more correct condition, which is covering more cases:
Code: | Body =~ charset\s*=\s*"?windows-1251|koi8-r|shift_jis|iso-2022-jp|gb2312|big5 |
|
|
Back to top |
|
|
|