当你的分类模型有数百个或数千个特征,由于是文本分类的情况下,许多(如果不是大多数)的特点是低信息量的,这是一个不错的选择.这些特征对所有类都是通用的,因此在分类过程中作出很小贡献.个别是无害的,但汇总的话,低信息量的特征会降低性能.
通过消除噪声数据给你的模型清晰度,这样就去除了低信息量特征.它可以把你从过拟合和维数灾难中救出来.当你只使用更高的信息特征,可以提高性能,同时也降低了模型的大小,从而导致伴随着更快的训练和分类的是,使用更少的内存的大小.删除特征似乎直觉错了,但请等你看到结果.
高信息量特征的选择
用同样的 evaluate_classifier 方法在以前的文章上使用二元组分类,我用 10000 最具信息量的词得到了以下的结果:
evaluating best word features accuracy: 0.93 pos precision: 0.890909090909 pos recall: 0.98 neg precision: 0.977777777778 neg recall: 0.88 Most Informative Features magnificent = True pos: neg = 15.0 : 1.0 outstanding = True pos: neg = 13.6 : 1.0 insulting = True neg: pos = 13.0 : 1.0 vulnerable = True pos: neg = 12.3 : 1.0 ludicrous = True neg: pos = 11.8 : 1.0 avoids = True pos: neg = 11.7 : 1.0 uninvolving = True neg: pos = 11.7 : 1.0 astounding = True pos: neg = 10.3 : 1.0 fascination = True pos: neg = 10.3 : 1.0 idiotic = True neg: pos = 9.8 : 1.0
把这个与使用了所有单词作为特征的第一篇文章中的情感分类相比:
evaluating single word features accuracy: 0.728 pos precision: 0.651595744681 pos recall: 0.98 neg precision: 0.959677419355 neg recall: 0.476 Most Informative Features magnificent = True pos: neg = 15.0 : 1.0 outstanding = True pos: neg = 13.6 : 1.0 insulting = True neg: pos = 13.0 : 1.0 vulnerable = True pos: neg = 12.3 : 1.0 ludicrous = True neg: pos = 11.8 : 1.0 avoids = True pos: neg = 11.7 : 1.0 uninvolving = True neg: pos = 11.7 : 1.0 astounding = True pos: neg = 10.3 : 1.0 fascination = True pos: neg = 10.3 : 1.0 idiotic = True neg: pos = 9.8 : 1.0
只用最好的 10000 个词,accuracy 就超过了 20%和 POS precision 增加了近 24%,而负召回提高 40%以上.这些都是巨大的增加,没有减少,POS 召回和 NEG 精度甚至略有增加.下面是我得到这些结果的完整代码和解释.
import collections,
itertools import nltk.classify.util,
nltk.metrics from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews,
stopwords from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures from nltk.probability import FreqDist,
ConditionalFreqDist
def evaluate_classifier(featx) : negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos')
negfeats = [(featx(movie_reviews.words(fileids = [f])), 'neg') for f in negids] posfeats = [(featx(movie_reviews.words(fileids = [f])), 'pos') for f in posids]
negcutoff = len(negfeats) * 3 / 4 poscutoff = len(posfeats) * 3 / 4
trainfeats = negfeats[: negcutoff] + posfeats[: poscutoff] testfeats = negfeats[negcutoff: ] + posfeats[poscutoff: ]
classifier = NaiveBayesClassifier.train(trainfeats) refsets = collections.defaultdict(set) testsets = collections.defaultdict(set)
for i,
(feats, label) in enumerate(testfeats) : refsets[label].add(i) observed = classifier.classify(feats) testsets[observed].add(i)
print 'accuracy:',
nltk.classify.util.accuracy(classifier, testfeats) print 'pos precision:',
nltk.metrics.precision(refsets['pos'], testsets['pos']) print 'pos recall:',
nltk.metrics.recall(refsets['pos'], testsets['pos']) print 'neg precision:',
nltk.metrics.precision(refsets['neg'], testsets['neg']) print 'neg recall:',
nltk.metrics.recall(refsets['neg'], testsets['neg']) classifier.show_most_informative_features()
def word_feats(words) : return dict([(word, True) for word in words])
print 'evaluating single word features'evaluate_classifier(word_feats)
word_fd = FreqDist() label_word_fd = ConditionalFreqDist()
for word in movie_reviews.words(categories = ['pos']) : word_fd.inc(word.lower()) label_word_fd['pos'].inc(word.lower())
for word in movie_reviews.words(categories = ['neg']) : word_fd.inc(word.lower()) label_word_fd['neg'].inc(word.lower())
#n_ii = label_word_fd[label][word]#n_ix = word_fd[word]#n_xi = label_word_fd[label].N()#n_xx = label_word_fd.N()
pos_word_count = label_word_fd['pos'].N() neg_word_count = label_word_fd['neg'].N() total_word_count = pos_word_count + neg_word_count
word_scores = {}
for word,
freq in word_fd.iteritems() : pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count) neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count) word_scores[word] = pos_score + neg_score
best = sorted(word_scores.iteritems(), key = lambda(w, s) : s, reverse = True)[: 10000] bestwords = set([w
for w, s in best])
def best_word_feats(words) : return dict([(word, True) for word in words
if word in bestwords])
print 'evaluating best word features'evaluate_classifier(best_word_feats)
def best_bigram_word_feats(words, score_fn = BigramAssocMeasures.chi_sq, n = 200) : bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) d = dict([(bigram, True) for bigram in bigrams]) d.update(best_word_feats(words)) return d
print 'evaluating best words + bigram chi_sq word features'evaluate_classifier(best_bigram_word_feats)
计算信息增益
要找到最具信息的特征,我们需要为每个词计算信息增益.分类的信息增益是一项度量一个常见的特征在一个特定的类和其他类中的对比.一个主要出现在正面电影评论中的词,很少在负面评论中出现就是具有高的信息量.例如,在电影评论中 "magnificent" 的存在是一个重要指标,表明是正向的.这使得 "magnificent" 是高信息量的词.注意,上面的信息量最大的特征并没有改变.这是有道理的,因为该观点是只使用最有信息量的特征而忽略其他.
显著的二元词组
上面的代码还评估了包含 200 个显著二元词组的搭配.下面是结果:
evaluating best words + bigram chi_sq word features accuracy: 0.92 pos precision: 0.913385826772 pos recall: 0.928 neg precision: 0.926829268293 neg recall: 0.912 Most Informative Features magnificent = True pos: neg = 15.0 : 1.0 outstanding = True pos: neg = 13.6 : 1.0 insulting = True neg: pos = 13.0 : 1.0 vulnerable = True pos: neg = 12.3 : 1.0('matt', 'damon') = True pos: neg = 12.3 : 1.0('give', 'us') = True neg: pos = 12.3 : 1.0 ludicrous = True neg: pos = 11.8 : 1.0 uninvolving = True neg: pos = 11.7 : 1.0 avoids = True pos: neg = 11.7 : 1.0('absolutely', 'no') = True neg: pos = 10.6 : 1.0
这表明,只采用高信息量的词的时候二元组并没有多重要.在这种情况下,评估包括二元组或没有的区别的最好方法是看精度和召回.用二元组,你得到的每个类的更均匀的性能.如果没有二元组,准确率和召回率不太平衡.但差异可能取决于您的特定数据,所以不要假设这些观察总是正确的.
改善特征选择
这里最大的教训是,改善特征选择会改善你的分类器.降维是提高分类器性能的你可以做的最好的事情之一.如果数据不增加价值,抛弃也没关系的.特别推荐的是有时数据实际上使你的模型变得更糟.
来源: http://lib.csdn.net/article/machinelearning/36340