Python 计算文本中单个单词的TD-IDF

Python 计算文本中单个单词的TD-IDF,python,machine-learning,nlp,spacy,textacy,Python,Machine Learning,Nlp,Spacy,Textacy,我试图用它来计算标准语料库中单个单词的TF-IDF分数,但对我得到的结果有点不清楚 我希望有一个浮点数,代表语料库中单词的频率。那么为什么我会收到一份包含7个结果的列表 “acculer”实际上是一个法语单词,所以我们希望从英语语料库中得到0的结果 word='accurler' 向量化器=文本性。向量化器(tf_type='linear',apply_idf=True,idf_type='smooth') tf_idf=矢量器。拟合变换(word) logger.info(“tf_idf:”)

我试图用它来计算标准语料库中单个单词的TF-IDF分数,但对我得到的结果有点不清楚

我希望有一个浮点数,代表语料库中单词的频率。那么为什么我会收到一份包含7个结果的列表

“acculer”实际上是一个法语单词,所以我们希望从英语语料库中得到0的结果

word='accurler'
向量化器=文本性。向量化器(tf_type='linear',apply_idf=True,idf_type='smooth')
tf_idf=矢量器。拟合变换(word)
logger.info(“tf_idf:”)
logger.info(tfidf)
输出

tf_idf:
(0, 0)  2.386294361119891
(1, 1)  1.9808292530117262
(2, 1)  1.9808292530117262
(3, 5)  2.386294361119891
(4, 3)  2.386294361119891
(5, 2)  2.386294361119891
(6, 4)  2.386294361119891
['is', 'lets', 'me', 'not', 'test', 'that', 'them', 'this', 'thought','was', 'you']


[[1.69314718 0.         1.69314718 0.         0.         0.
  0.         1.28768207 0.         0.         0.        ]
 [0.         0.         0.         1.69314718 0.         1.69314718
  0.         1.28768207 1.69314718 1.69314718 1.69314718]
 [0.         1.69314718 0.         0.         1.69314718 0.
  1.69314718 0.         0.         0.         0.        ]]
array([[1.        ,     0.53044716,     0.35999211],
       [0.53044716,     1.        ,     0.35999211],
       [0.35999211,     0.35999211,     1.        ]])
问题的第二部分是,我如何为文本中的TF-IDF功能提供自己的语料库,特别是使用不同语言的TF-IDF功能

编辑

如@Vishal所述,我已使用以下行记录输出:

logger.info(vectorizer.vocabulary_terms)
提供的单词
acculer
似乎已拆分为多个字符

{'a': 0, 'c': 1, 'u': 5, 'l': 3, 'e': 2, 'r': 4}
(1) 如何根据语料库而不是每个字符获取该单词的TF-IDF

(2) 如何提供自己的语料库并将其作为参数指向


(3) TF-IDF能在句子层面上使用吗?ie:这句话的术语相对于语料库的相对频率是多少。

你可以根据语料库中的单词获得TF-IDF

docs = ['this is me','this was not that you thought', 'lets test them'] ## create a list of documents
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
vec.fit(docs) ##fit your documents

print(vec.vocabulary_) #print vocabulary, don't run for 2.5 million documents
输出:包含每个单词的idf,并在输出中为其分配唯一索引

{u'me': 2, u'them': 6, u'that': 5, u'this': 7, u'is': 0, u'thought': 8, u'not': 3, u'lets': 1, u'test': 4, u'you': 10, u'was': 9}

print(vec.idf_) 
输出:打印每个词汇表单词的idf值

[ 1.69314718  1.69314718  1.69314718  1.69314718  1.69314718  1.69314718 1.69314718  1.28768207  1.69314718  1.69314718  1.69314718]
现在根据您的问题,假设您想为某个单词找到tf idf,那么您可以通过以下方式获得:

word = 'thought' #example    
index = vec.vocabulary_[word] 
>8
print(vec.idf_[index]) #prints idf value
>1.6931471805599454
参考: 1.

现在对textacy做同样的操作

import spacy
nlp = spacy.load('en') ## install it by python -m spacy download en (run as administrator)

doc_strings = [
    'this is me','this was not that you thought', 'lets test them'
]
docs = [nlp(string.lower()) for string in doc_strings]
corpus = textacy.Corpus(nlp,docs =docs)
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, normalize='lower',as_strings=True,filter_stops=False) for doc in corpus))

print(vectorizer.terms_list)
print(doc_term_matrix.toarray())
输出

tf_idf:
(0, 0)  2.386294361119891
(1, 1)  1.9808292530117262
(2, 1)  1.9808292530117262
(3, 5)  2.386294361119891
(4, 3)  2.386294361119891
(5, 2)  2.386294361119891
(6, 4)  2.386294361119891
['is', 'lets', 'me', 'not', 'test', 'that', 'them', 'this', 'thought','was', 'you']


[[1.69314718 0.         1.69314718 0.         0.         0.
  0.         1.28768207 0.         0.         0.        ]
 [0.         0.         0.         1.69314718 0.         1.69314718
  0.         1.28768207 1.69314718 1.69314718 1.69314718]
 [0.         1.69314718 0.         0.         1.69314718 0.
  1.69314718 0.         0.         0.         0.        ]]
array([[1.        ,     0.53044716,     0.35999211],
       [0.53044716,     1.        ,     0.35999211],
       [0.35999211,     0.35999211,     1.        ]])
参考资料:

基础知识 在研究实际问题之前,让我们先弄清楚定义

假设我们的语料库包含3个文档(分别为d1、d2和d3):

术语频率(tf) tf(单词的)定义为单词在文档中出现的次数

tf(word, document) = count(word, document) # Number of times word appears in the document
tf是在文档级别为单词定义的

tf('a',d1)     = 1      tf('a',d2)     = 1      tf('a',d3)     = 1
tf('apple',d1) = 1      tf('apple',d2) = 1      tf('apple',d3) = 0
tf('cat',d1)   = 0      tf('cat',d2)   = 0      tf('cat',d3)   = 1
tf('green',d1) = 0      tf('green',d2) = 1      tf('green',d3) = 0
tf('is',d1)    = 1      tf('is',d2)    = 1      tf('is',d3)    = 1
tf('red',d1)   = 1      tf('red',d2)   = 0      tf('red',d3)   = 0
tf('this',d1)  = 1      tf('this',d2)  = 1      tf('this',d3)  = 1
使用原始计数存在一个问题,即较长文档中单词的
tf
值比较短文档中的值高。这个问题可以通过将原始计数值除以文档长度(对应文档中的字数)来规范化来解决。这称为
l1
normalization。文档
d1
现在可以用
tf向量
表示,其中包含语料库vocubulary中所有单词的所有
tf
值。还有一种称为
l2
的规范化,它使文档的tf向量的
l2
范数等于1

tf(word, document, normalize='l1') = count(word, document)/|document|
tf(word, document, normalize='l2') = count(word, document)/l2_norm(document)
代码:tf

corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
# Convert docs to textacy format
textacy_docs = [textacy.Doc(doc) for doc in corpus]

for norm in [None, 'l1', 'l2']:
    # tokenize the documents
    tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

    # Fit the tf matrix 
    vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

    print ("\nVocabulary: ", vectorizer.vocabulary_terms)
    print ("TF with {0} normalize".format(norm))
    print (doc_term_matrix.toarray())
textacy_docs = [textacy.Doc(doc) for doc in corpus]    
tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("standard idf: ")
print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))
textacy_docs = [textacy.Doc(doc) for doc in corpus]

tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("tf-idf: ")

vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print (doc_term_matrix.toarray())
输出:

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with None normalize
[[1 1 0 0 1 1 1]
 [1 1 0 1 1 0 1]
 [1 0 1 0 1 0 1]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l1 normalize
[[0.2  0.2  0.   0.   0.2  0.2  0.2 ]
 [0.2  0.2  0.   0.2  0.2  0.   0.2 ]
 [0.25 0.   0.25 0.   0.25 0.   0.25]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l2 normalize
[[0.4472136 0.4472136 0.        0.        0.4472136 0.4472136 0.4472136]
 [0.4472136 0.4472136 0.        0.4472136 0.4472136 0.        0.4472136]
 [0.5       0.        0.5       0.        0.5       0.        0.5      ]]
Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
standard idf: 
[1.     1.405       2.098       2.098       1.      2.098       1.]
Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
tf-idf: 
[[1.         1.405   0.         0.         1.         2.098   1.        ]
 [1.         1.405   0.         2.098      1.         0.      1.        ]
 [1.         0.      2.098      0.         1.         0.      1.        ]]
tf
矩阵中的行对应于文档(因此我们的语料库有3行),列对应于词汇表中的每个单词(词汇词典中显示的单词索引)

反向文档频率(idf) 有些词传递的信息比其他词少。例如,像,a,an,this这样的词是非常常见的词,它们传递的信息非常少。idf是衡量这个词重要性的一个指标。我们认为一个词出现在许多文件中,比在少数文档中出现的词要少。p>
idf(word, corpus) = log(|corpus| / No:of documents containing word) + 1  # standard idf
对于我们的语料库,直观地
idf(苹果,语料库)

代码:idf

corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
# Convert docs to textacy format
textacy_docs = [textacy.Doc(doc) for doc in corpus]

for norm in [None, 'l1', 'l2']:
    # tokenize the documents
    tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

    # Fit the tf matrix 
    vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

    print ("\nVocabulary: ", vectorizer.vocabulary_terms)
    print ("TF with {0} normalize".format(norm))
    print (doc_term_matrix.toarray())
textacy_docs = [textacy.Doc(doc) for doc in corpus]    
tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("standard idf: ")
print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))
textacy_docs = [textacy.Doc(doc) for doc in corpus]

tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("tf-idf: ")

vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print (doc_term_matrix.toarray())
输出:

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with None normalize
[[1 1 0 0 1 1 1]
 [1 1 0 1 1 0 1]
 [1 0 1 0 1 0 1]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l1 normalize
[[0.2  0.2  0.   0.   0.2  0.2  0.2 ]
 [0.2  0.2  0.   0.2  0.2  0.   0.2 ]
 [0.25 0.   0.25 0.   0.25 0.   0.25]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l2 normalize
[[0.4472136 0.4472136 0.        0.        0.4472136 0.4472136 0.4472136]
 [0.4472136 0.4472136 0.        0.4472136 0.4472136 0.        0.4472136]
 [0.5       0.        0.5       0.        0.5       0.        0.5      ]]
Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
standard idf: 
[1.     1.405       2.098       2.098       1.      2.098       1.]
Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
tf-idf: 
[[1.         1.405   0.         0.         1.         2.098   1.        ]
 [1.         1.405   0.         2.098      1.         0.      1.        ]
 [1.         0.      2.098      0.         1.         0.      1.        ]]
术语频率–反向文档频率(tf idf):tf idf是衡量一个词在语料库中文档中的重要性的指标。单词的tf与其id加权后给出了单词的tf-idf度量

tf-idf(word, document, corpus) = tf(word, docuemnt) * idf(word, corpus)
代码:tf idf

corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
# Convert docs to textacy format
textacy_docs = [textacy.Doc(doc) for doc in corpus]

for norm in [None, 'l1', 'l2']:
    # tokenize the documents
    tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

    # Fit the tf matrix 
    vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

    print ("\nVocabulary: ", vectorizer.vocabulary_terms)
    print ("TF with {0} normalize".format(norm))
    print (doc_term_matrix.toarray())
textacy_docs = [textacy.Doc(doc) for doc in corpus]    
tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("standard idf: ")
print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))
textacy_docs = [textacy.Doc(doc) for doc in corpus]

tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("tf-idf: ")

vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print (doc_term_matrix.toarray())
输出:

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with None normalize
[[1 1 0 0 1 1 1]
 [1 1 0 1 1 0 1]
 [1 0 1 0 1 0 1]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l1 normalize
[[0.2  0.2  0.   0.   0.2  0.2  0.2 ]
 [0.2  0.2  0.   0.2  0.2  0.   0.2 ]
 [0.25 0.   0.25 0.   0.25 0.   0.25]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l2 normalize
[[0.4472136 0.4472136 0.        0.        0.4472136 0.4472136 0.4472136]
 [0.4472136 0.4472136 0.        0.4472136 0.4472136 0.        0.4472136]
 [0.5       0.        0.5       0.        0.5       0.        0.5      ]]
Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
standard idf: 
[1.     1.405       2.098       2.098       1.      2.098       1.]
Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
tf-idf: 
[[1.         1.405   0.         0.         1.         2.098   1.        ]
 [1.         1.405   0.         2.098      1.         0.      1.        ]
 [1.         0.      2.098      0.         1.         0.      1.        ]]
现在来回答问题: (1) 我怎样才能在语料库中找到这个词的TF-IDF呢 比每个角色都重要

如上所述,没有独立定义的
tf-idf
,单词的
tf-idf
与语料库中的文档相关

(2) 如何提供自己的语料库并将其作为参数指向

如上述样本所示

  • 使用textacy.docAPI将文本文档转换为textacy文档
  • 标记textacy.Doc使用to_terms_list方法。(使用此方法,您可以使用将unigram、bigram或trigram添加到词汇表中,过滤掉停止词SM noramalize文本等)
  • 使用textacy.Vectorizer从标记化文档创建术语矩阵。返回的术语矩阵为
    • tf(原始计数):应用_idf=False,norm=None
    • tf(l1标准化):应用idf=False,norm='l1'
    • tf(l2规范化):应用\u idf=False,norm='l2'
    • tf-idf(标准):应用\u idf=True,idf\u type='standard'
  • (3) TF-IDF能在句子层面上使用吗?ie:什么是相对的 这句话的词条在语料库中的频率

    是的,如果且仅当您将每个句子视为单独的文档时,您可以。在这种情况下,相应文档的
    tf idf
    向量(整行)可以被视为文档的向量表示(在您的案例中是一个句子)

    在我们的语料库中(实际上每个文档包含一个句子),d1和d2的向量表示应该与向量d1和d3接近。让我们检查余弦相似性并查看:

    cosine_similarity(doc_term_matrix)
    
    输出

    tf_idf:
    (0, 0)  2.386294361119891
    (1, 1)  1.9808292530117262
    (2, 1)  1.9808292530117262
    (3, 5)  2.386294361119891
    (4, 3)  2.386294361119891
    (5, 2)  2.386294361119891
    (6, 4)  2.386294361119891
    
    ['is', 'lets', 'me', 'not', 'test', 'that', 'them', 'this', 'thought','was', 'you']
    
    
    [[1.69314718 0.         1.69314718 0.         0.         0.
      0.         1.28768207 0.         0.         0.        ]
     [0.         0.         0.         1.69314718 0.         1.69314718
      0.         1.28768207 1.69314718 1.69314718 1.69314718]
     [0.         1.69314718 0.         0.         1.69314718 0.
      1.69314718 0.         0.         0.         0.        ]]
    
    array([[1.        ,     0.53044716,     0.35999211],
           [0.53044716,     1.        ,     0.35999211],
           [0.35999211,     0.35999211,     1.        ]])
    
    正如您可以看到的,余弦_相似性(d1,d2)=0.53和余弦_相似性(d1,d3)=0.35,因此实际上d1和d2比d1和d3更相似(1完全相似,0不相似-正交向量)

    一旦您训练了
    矢量器
    ,您就可以将训练过的对象pickle到磁盘上以备以后使用

    结论 单词的
    tf
    位于文档级,单词的
    idf
    位于语料库级,单词的
    tf-idf
    位于文档相对于语料库的位置。它们非常适合于文档的矢量表示