HPE GreenLake Administration
- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- Re: Practical SA-Learn input methodologies
Operating System - Linux
1827810
Members
1953
Online
109969
Solutions
Forums
Categories
Company
Local Language
back
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
back
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Blogs
Information
Community
Resources
Community Language
Language
Forums
Blogs
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2005 04:21 AM
12-08-2005 04:21 AM
Practical SA-Learn input methodologies
this isn't really a Linux question, but i hope there are enough experts here on Spam Assassin to help me.
recently some of my user community have been complaining about spam, and I'd like to give them an automated method to train the Bayesian filter. Since the Bayesian filter depends on having copious quantities of red and green messages, so I'd like to get as many as possible into the learning process with as little effort to the users as possible.
i have a cron job to learn messages and accounts set up, but I'm running into a bit of an issue with the practical application.
user base is (in descending count order) outlook/pop3, outlook/imap, thunderbird/pop3, thunderbird/imap, and at least one Pegasus/pop3 oddball.
Given this, I've come up with two possible input methodologies.
A) red and green email addresses, the user forwards spam to RedList@domain.com, and good messages to GreenList@domain.com. Problems: misses a lot of green messages that the user doesn't forward to the list, headers all messed up due to the forward process.
B) imap folder for spam, processed nightly. User moves the spam message to the spam folder (preserving the headers) and it's learned and deleted. The rest of the imap store is learned as green. Problems: only available to imap users. Messages may be processed by both red and green if a user doesn't move it in the cron time frame.
Which brings me around to my questions: which of these methods causes less statistical damage to the Bayesian databases? Are there any mitigating steps i can employ to offset these problems? Are there any better methods? Am i missing something major?
recently some of my user community have been complaining about spam, and I'd like to give them an automated method to train the Bayesian filter. Since the Bayesian filter depends on having copious quantities of red and green messages, so I'd like to get as many as possible into the learning process with as little effort to the users as possible.
i have a cron job to learn messages and accounts set up, but I'm running into a bit of an issue with the practical application.
user base is (in descending count order) outlook/pop3, outlook/imap, thunderbird/pop3, thunderbird/imap, and at least one Pegasus/pop3 oddball.
Given this, I've come up with two possible input methodologies.
A) red and green email addresses, the user forwards spam to RedList@domain.com, and good messages to GreenList@domain.com. Problems: misses a lot of green messages that the user doesn't forward to the list, headers all messed up due to the forward process.
B) imap folder for spam, processed nightly. User moves the spam message to the spam folder (preserving the headers) and it's learned and deleted. The rest of the imap store is learned as green. Problems: only available to imap users. Messages may be processed by both red and green if a user doesn't move it in the cron time frame.
Which brings me around to my questions: which of these methods causes less statistical damage to the Bayesian databases? Are there any mitigating steps i can employ to offset these problems? Are there any better methods? Am i missing something major?
There have been Innumerable people who have helped me. Of course, I've managed to piss most of them off.
2 REPLIES 2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2005 06:04 AM
12-08-2005 06:04 AM
Re: Practical SA-Learn input methodologies
Shalom Thomas,
Let me suggest a slightly different method.
Don't forward the messages, harming the headers.
Move them to a spam folder, like item b.
You should be able to come up with move methodology that works for popmail as well.
I have in place a methodology that is similar to yours.
I or my users move the message using a web mail reader like squirrelmail(imap i know) to a folder called spam.
That folder is ready periodically by a cron job that can provide clean headers to a spam filer and a custom strip tha can go through the header and make block entries for the /etc/mail/access file.
We may wish to even swap methodology to improve the spam filtering on our respective servers.
SEP
Let me suggest a slightly different method.
Don't forward the messages, harming the headers.
Move them to a spam folder, like item b.
You should be able to come up with move methodology that works for popmail as well.
I have in place a methodology that is similar to yours.
I or my users move the message using a web mail reader like squirrelmail(imap i know) to a folder called spam.
That folder is ready periodically by a cron job that can provide clean headers to a spam filer and a custom strip tha can go through the header and make block entries for the /etc/mail/access file.
We may wish to even swap methodology to improve the spam filtering on our respective servers.
SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-14-2005 02:22 AM
12-14-2005 02:22 AM
Re: Practical SA-Learn input methodologies
thanks SEP, i think i'm going to go with the IMAP folder method, and just have the pop clients forward their spam and notspam by attachment.
will post the cron script once i have it debugged.
will post the cron script once i have it debugged.
There have been Innumerable people who have helped me. Of course, I've managed to piss most of them off.
The opinions expressed above are the personal opinions of the authors, not of Hewlett Packard Enterprise. By using this site, you accept the Terms of Use and Rules of Participation.
Company
Support
Events and news
Customer resources
© Copyright 2025 Hewlett Packard Enterprise Development LP