Python 解释文档中TF-IDF单词分数的总和
首先,让我们提取每个文档每个学期的TF-IDF分数:Python 解释文档中TF-IDF单词分数的总和,python,statistics,nlp,tf-idf,gensim,Python,Statistics,Nlp,Tf Idf,Gensim,首先,让我们提取每个文档每个学期的TF-IDF分数: from gensim import corpora, models, similarities documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS us
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
打印出来:
for doc in corpus_tfidf:
print doc
[out]:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
如果我们想在这个语料库中找到单词的“显著性”或“重要性”,我们可以简单地将所有文档中的tf idf分数相加,然后除以文档数量吗
>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
... for word, score in doc:
... tfidf_saliency[word] += score / len(corpus_tfidf)
...
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})
从输出来看,我们是否可以假设语料库中最“突出”的单词是:
>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'
如果是这样的话,对文档中的TF-IDF分数总和的数学解释是什么?可以在两种上下文中计算显著性
所以,一般来说,从数学上来说,我希望你会得到一个不理想的平均效应。语料库中对TF-IDF的解释是给定术语语料库中最高的TF-IDF 在语料库中查找最前面的单词。
topWords = {}
for doc in corpus_tfidf:
for iWord, tf_idf in doc:
if iWord not in topWords:
topWords[iWord] = 0
if tf_idf > topWords[iWord]:
topWords[iWord] = tf_idf
for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
if i == 6: break
输出比较购物车:注意:无法使用
gensim
,使用语料库创建匹配的词典
只能显示单词Indizies
Question tfidf_saliency topWords(corpus_tfidf) Other TF-IDF implentation
---------------------------------------------------------------------------
1: Word(7) 0.121 1: Word(13) 0.640 1: paths 0.376019
2: Word(8) 0.111 2: Word(27) 0.632 2: intersection 0.376019
3: Word(26) 0.108 3: Word(28) 0.632 3: survey 0.366204
4: Word(29) 0.100 4: Word(8) 0.628 4: minors 0.366204
5: Word(9) 0.090 5: Word(29) 0.628 5: binary 0.300815
6: Word(14) 0.087 6: Word(11) 0.544 6: generation 0.300815
TF-IDF的计算始终考虑语料库
用Python测试:3.4.2这是一个很好的讨论。谢谢你开始这个帖子。通过@avip包含文档长度的想法似乎很有趣。我们必须进行实验并检查结果。同时,让我试着换一种方式问这个问题。在查询TF-IDF相关性得分时,我们试图解释什么
可能试图在文档级别理解单词相关性
可能试图理解每个类的单词相关性
可能试图从整体上理解单词的相关性
语料库)
结果:
# Result post computing TF-IDF relevance scores
array([[ 0.81940995, 0. , 0.57320793],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.47330339, 0.88089948, 0. ],
[ 0.58149261, 0. , 0.81355169]])
# Result post aggregation (Sum, Mean)
[[ 4.87420595 0.88089948 1.38675962]]
[[ 0.81236766 0.14681658 0.2311266 ]]
如果我们仔细观察,就会发现所有文档中出现的特性1并没有被完全忽略,因为idf的sklearn实现=log[n/df(d,t)]+1+添加1,以便不会忽略所有文档中碰巧出现的重要单词。例如,“自行车”一词在将特定文档归类为“摩托车”(20_新闻组数据集)时经常出现
现在,关于前两个问题,一个是试图解释和理解文档中可能出现的最常见的特性。在这种情况下,以某种形式进行聚合,包括文档中所有可能出现的单词,即使在数学上也不会带走任何东西。在我看来,这样的查询对于探索数据集和帮助理解数据集的内容非常有用。该逻辑也可以应用于使用散列的矢量化
相关性得分=平均值(tf(t,d)*idf(t,d))=平均值(偏差+
初始值(t,d)/max{F(t,d)}*log(N/df(d,t))+1)
问题3非常重要,因为它也可能有助于
选择用于构建预测模型的特征。单独使用TF-IDF分数进行特征选择可能会在多个层面上产生误导。采用更具理论性的统计测试,如“chi2”与TF-IDF相关性得分结合,可能是一种更好的方法。此类统计测试还评估特征相对于相应目标类的重要性
当然,将这种解释与模型学习到的特征权重相结合,将有助于理解文本的重要性
# Result post computing TF-IDF relevance scores
array([[ 0.81940995, 0. , 0.57320793],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.47330339, 0.88089948, 0. ],
[ 0.58149261, 0. , 0.81355169]])
# Result post aggregation (Sum, Mean)
[[ 4.87420595 0.88089948 1.38675962]]
[[ 0.81236766 0.14681658 0.2311266 ]]
[terms/docs : doc1 , doc2 , doc3..... docn
term1 : tf(doc1)-idf, tf(doc2)-idf , tf(doc3)-idf.....
.
.
.
termn ........ ]