利用IKAnalyzer中文分词器模拟Bayes算法对垃圾邮件的过滤的实现（附源码）

zhangcong170

浏览: 69736 次
性别:
来自: 长沙

最近访客更多访客>>

青衫w

warnerhit

lyqkb

zhoufox

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

算法 lucene Apache Eclipse .net

这段时间抽空做了一个利用Bayes算法来模拟对邮件进行过滤的小东东，算是对自己的一个锻炼吧。前前后后总共花了一周多的时间。
有关Bayes算法在反垃圾邮件中的应用，请看这里:

http://www.5dmail.net/html/2006-5-18/2006518234548.htm

很显然，上面的链接给的一个demo中，有一个比较大的缺陷，那就是它只是一个一个地统计字，而不是词语。在处理中文的问题上，demo的思路是行不通的，必须统计词语而不是一个一个的单字。从大量的文本中提取一个一个的我们熟悉的词语，说起来容易，做起来就相当的困难了，这个问题纠结了我很多天，终于发现原来就在javaEye首页上有对中文分词器IKAnalyzer的介绍，真的是“踏破铁鞋无觅处，得来全不费功夫”，立马把IKAnalyzer“偷”了过去，自己使用了一下，果然是非常的强大。

有关IKAnalyzer中文分词器，请参看如下地址：

http://www.iteye.com/wiki/interview/1899-ik-analyzer

下面贴一段代码，看看这个分词器的强大:

package ik.com.cn.test;

import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.wltea.analyzer.lucene.IKAnalyzer;
/*
 * 分词的小Demo
 * 这个分词工具确实比较强悍
 * 相关的jar包在附件eclipse工程下lib包下
 */
public class SecondTest {
  public static void main(String args[]) throws IOException{
	    Analyzer analyzer = new IKAnalyzer();
		String text="中华人民共和国"; 
		StringReader reader = new StringReader(text); 
		TokenStream ts = analyzer.tokenStream(text, reader); 
		Token t = ts.next(new Token()); 
		while (t != null) {
		    String s=t.term();
		    System.out.println(s); 
		    t = ts.next(new Token()); 
		} 
  }
}

输出结果：

中华人民共和国
中华人民
中华
华人
人民共和国
人民
共和国
共和

下面贴出一个比较核心的类Bayes，主要是实现具体的算法。

package com.cn.bayes;

import java.util.*;

public class Bayes {
	private Hashtable<String, Integer> hashtable_good;
	private Hashtable<String, Integer> hashtabel_bad;
	private Hashtable<String, Double> hashtable_good_p;
	private Hashtable<String, Double> hashtable_bad_p;
	private Hashtable<String, Double> hashtable_probability;

	public Hashtable<String, Double> getHashtable_probability() {
		return hashtable_probability;
	}

	public void setHashtable_probability(
			Hashtable<String, Double> hashtableProbability) {
		hashtable_probability = hashtableProbability;
	}

	public Bayes(Hashtable<String, Integer> hashtable_good,
			Hashtable<String, Integer> hashtabel_bad) {
		this.hashtable_good = hashtable_good;
		this.hashtabel_bad = hashtabel_bad;
		this.hashtable_good_p = this.getGoodOrBadPercent(hashtable_good);
		this.hashtable_bad_p = this.getGoodOrBadPercent(hashtabel_bad);

		Set<String> set_allkeys = this.combineHasetableByKeys(this
				.getHashtable_good(), this.getHashtabel_bad());
//		Set s=hashtable_good_p.entrySet();
//		System.out.println("hashtable_good_p");
//		for (Object o :s.toArray()) {
//			System.out.print("  "+o);
//		}
//		s=hashtable_bad_p.entrySet();
//		System.out.println("hashtable_bad_p");
//		for (Object o :s.toArray()) {
//			System.out.print("  "+o);
//		}
//		
//		for (Object o : set_allkeys.toArray()) {
//			System.out.print("  "+o);
//		}
		this.hashtable_probability = this
				.calcHashtable_probability(set_allkeys);
//		System.out.println();
//        s=hashtable_probability.entrySet();
//		for (Object o : s.toArray()) {
//			System.out.print("  "+o);
//		}
	}

	/*
	 * 通过统计计算 得到 hashtable_good_p 或 hashtable_bad_p
	 */
	@SuppressWarnings("unchecked")
	private Hashtable<String, Double> getGoodOrBadPercent(
			Hashtable hashtable_goodOrBad) {
		Hashtable<String, Double> percent = new Hashtable<String, Double>();
		int total = 0;
		String key;
		Integer value;
		Enumeration enumeration = hashtable_goodOrBad.elements();
		while (enumeration.hasMoreElements()) {
			total = total + (Integer) enumeration.nextElement();
		}
//		System.out.println("total=" + total);

		enumeration = hashtable_goodOrBad.keys();
		while (enumeration.hasMoreElements()) {
			key = (String) enumeration.nextElement();
			value = (Integer) hashtable_goodOrBad.get(key);
//			System.out.println(key + "   " + value);
			percent.put(key, new Double((value + 0.0) / total));
		}
//		 Set s = percent.entrySet();
//		 for (Object o : s.toArray()) {
//		 System.out.println(o);
//		 }
		return percent;
	}

	/*
	 * 将两个hash表的所有key值保存在一个Set中，Set是不允许出现重复的元素
	 * 
	 * 注意：这个也比较容易扩展将多个hash表的所有key保存在一个Set中
	 * 
	 */
	@SuppressWarnings("unchecked")
	private Set<String> combineHasetableByKeys(
			Hashtable<String, Integer> hashtable_good,
			Hashtable<String, Integer> hashtabel_bad) {
		Set<String> allkeysSet = new HashSet();

		Set<String> goodKeysSet = hashtable_good.keySet();
		Set<String> badKeysSet = hashtabel_bad.keySet();
		Iterator it;
		it = goodKeysSet.iterator();
		while (it.hasNext()) {
			allkeysSet.add(it.next().toString());
		}
		it = badKeysSet.iterator();
		while (it.hasNext()) {
			allkeysSet.add(it.next().toString());
		}
		return allkeysSet;
	}

	/*
	 * 根据Set提供的key值，计算每个key对应出现的可能性，并封装到hashtable中
	 * 
	 */
	@SuppressWarnings("unchecked")
	private Hashtable<String, Double> calcHashtable_probability(
			Set<String> set_allkeys) {
		Iterator it = set_allkeys.iterator();
		Hashtable<String, Double> hashtable_probability = new Hashtable();
		while (it.hasNext()) {
			String key = it.next().toString();
			Double good_p_value = this.hashtable_good_p.get(key);
			Double bad_p_value = this.hashtable_bad_p.get(key);
			if (null == good_p_value) {
				good_p_value = 0.0;
			}
			if (null == bad_p_value) {
				bad_p_value = 0.0;
			}
			Double result = good_p_value + bad_p_value;
			Double percent=null;
			if (result != 0.0) {
				percent = bad_p_value / result;
			}
			hashtable_probability.put(key, percent);
		}
		return hashtable_probability;
	}

	public Hashtable<String, Integer> getHashtable_good() {
		return hashtable_good;
	}

	public Hashtable<String, Integer> getHashtabel_bad() {
		return hashtabel_bad;
	}
}

具体的代码都在附件中，欢迎各位指导

BayesTest.rar (2.8 MB)
下载次数: 91

分享到：

java的equeals方法和==方法的比较 | android能实现接收并处理GPS信息吗？

2009-09-06 13:08
浏览 3243
评论(2)
查看更多

2 楼 great656747 2012-05-08

正在看，有点吃力，不过这方面的资料实在是不多，楼主的资料很有帮助

1 楼 ftp51423121 2010-02-08

不是很理解？？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论