View previous topic :: View next topic |
Author |
Message |
Sacles
Joined: 09 Nov 2007 Posts: 51 Location: Belgium (near Li?ge)
|
Posted: Sun Nov 11, 2007 8:36 am Post subject: To train ASS |
|
|
Hello,
I can load more than 2000 spams (*.eml).
To train ASS, is it useful to import them in The Bat! and then to classify them as spams?
Address: http: //foxmail.free.fr/dl/spamagogo/
I find this address on the French-speaking site of Foxmail. |
|
Back to top |
|
|
Sacles
Joined: 09 Nov 2007 Posts: 51 Location: Belgium (near Li?ge)
|
Posted: Tue Nov 13, 2007 5:22 pm Post subject: |
|
|
No reply? |
|
Back to top |
|
|
vetaltm Author
Joined: 05 Feb 2006 Posts: 751
|
Posted: Wed Nov 14, 2007 4:16 am Post subject: Re: To train ASS |
|
|
Sacles wrote: | I can load more than 2000 spams (*.eml).
To train ASS, is it useful to import them in The Bat! and then to classify them as spams?
Address: http: //foxmail.free.fr/dl/spamagogo/
|
Yes, it makes sense to train the plug-in on additional spam messages. But please consider the following:
- Make sure that a message is truly spam before training the plug-in on it. Some phishing and "social engineering" spam messages can contain a lot of non-spam text and it may impair the overall classification quality.
- The plug-in must be trained on both ham and spam messages. The algorithm makes its best to avoid "overtraining", but it is not good when the database contains too much spam and too little ham messages.
- The best classification quality can be reached after training the plug-in on his mistakes, i.e. training the plug-in on your own messages, classified with the wrong spam ratio. |
|
Back to top |
|
|
Sacles
Joined: 09 Nov 2007 Posts: 51 Location: Belgium (near Li?ge)
|
Posted: Wed Nov 14, 2007 4:34 am Post subject: |
|
|
Hello,
Thank you for these advices.
------------
Does not the learning file risk to become too heavy?
Can I assure a maintenance of this file (without erasing it completely)?
For example, Spamihilator can compact the learning filter. |
|
Back to top |
|
|
vetaltm Author
Joined: 05 Feb 2006 Posts: 751
|
Posted: Wed Nov 14, 2007 5:32 am Post subject: |
|
|
Sacles wrote: |
Does not the learning file risk to become too heavy?
Cannot we assure a maintenance (without erasing it completely)?
For example, Spamihilator can compact the learning filter. |
Here are some additional points related to training to make things more clear:
- The plug-in classifies messages before adding them to classification database. The ham messages having low spam ratio and spam messages with high spam ratio are not added to the main classification database. The plug-in stores the correctly classified messages as "hints" in a separate database.
- The messages from "hints" database can be added to the main classification database in cases when the plug-in needs to improve the filtering quality after learning some new messages.
- The "hints" database is deleted periodically, whereas the main classification database contains only a minimum subset of learned messages, required to provide the best filtering quality. Thereby the database files will not become "heavy", unless it is absolutely necessary.
- When I wrote "overtraining" above, I meant the balance between ham and spam messages in the plug-in databases, not the overall quantity of messages from both classes. The plug-in decisions are based on spam and ham samples in it's database. If the database contains mostly ham or mostly spam messages, the plug-in cannot distinguish the messages of different classes with high confidence. So it is important to train the plug-in on the messages from both classes to make it "know" more about the differences between spam and ham. |
|
Back to top |
|
|
|