Python 理解卡方特征选择的问题
我在理解卡方特征选择方面一直存在问题。我有两个类,正数和负数,每个类包含不同的项和项计数。我需要执行卡方特征选择,为每个类提取最具代表性的术语。问题是我最终得到的正类和负类的术语完全相同。以下是我选择功能的Python代码:Python 理解卡方特征选择的问题,python,statistics,information-retrieval,chi-squared,Python,Statistics,Information Retrieval,Chi Squared,我在理解卡方特征选择方面一直存在问题。我有两个类,正数和负数,每个类包含不同的项和项计数。我需要执行卡方特征选择,为每个类提取最具代表性的术语。问题是我最终得到的正类和负类的术语完全相同。以下是我选择功能的Python代码: #!/usr/bin/python # import the necessary libraries import math class ChiFeatureSelector: def __init__(self, extCorpus, lookupCorpus
#!/usr/bin/python
# import the necessary libraries
import math
class ChiFeatureSelector:
def __init__(self, extCorpus, lookupCorpus):
# store the extraction corpus and lookup corpus
self.extCorpus = extCorpus
self.lookupCorpus = lookupCorpus
def select(self, outPath):
# dictionary of chi-squared scores
scores = {}
# loop over the words in the extraction corpus
for w in self.extCorpus.getTerms():
# build the chi-squared table
n11 = float(self.extCorpus.getTermCount(w))
n10 = float(self.lookupCorpus.getTermCount(w))
n01 = float(self.extCorpus.getTotalDocs() - n11)
n00 = float(self.lookupCorpus.getTotalDocs() - n10)
# perform the chi-squared calculation and store
# the score in the dictionary
a = n11 + n10 + n01 + n00
b = ((n11 * n00) - (n10 * n01)) ** 2
c = (n11 + n01) * (n11 + n10) * (n10 + n00) * (n01 + n00)
chi = (a * b) / c
scores[w] = chi
# sort the scores in descending order
scores = sorted([(v, k) for (k, v) in scores.items()], reverse = True)
i = 0
for (v, k) in scores:
print str(k) + " : " + str(v)
i += 1
if i == 10:
break
这就是我如何使用这个类(为了简洁起见省略了一些代码,是的,我已经检查了以确保两个小体不包含完全相同的数据)
# perform positive ngram feature selection
print "positive:\n"
f = ChiFeatureSelector(posCorpus, negCorpus)
f.select(posOutputPath)
print "\nnegative:\n"
# perform negative ngram feature selection
f = ChiFeatureSelector(negCorpus, posCorpus)
f.select(negOutputPath)
我觉得错误来自于我计算术语/文档表时,但我不确定。也许我没有理解某些东西。有人能给我指出正确的方向吗?在两类情况下,如果两个 数据集是相互交换的,它们是两个数据集之间差异最大的特征
这两个类。你能从extCorpus和lookupCorpus中添加一些样本数据吗?刚好可以看到结构……对不起,negCorpus和posCorpus+1。特征选择不会给你“强正”和“强负”特征,但会给你强辨别性特征。顺便说一句,在多类情况下也是如此。