Python TF-IDF得分的单词嵌入平均值_Python_Machine Learning_Nlp_Tf Idf_Word2vec

Python TF-IDF得分的单词嵌入平均值

python machine-learning nlp

Python TF-IDF得分的单词嵌入平均值,python,machine-learning,nlp,tf-idf,word2vec,Python,Machine Learning,Nlp,Tf Idf,Word2vec,我一直在开发一个python脚本来分类一篇文章是否与正文相关。为此，我一直在使用ML（SVM分类器）和一些特征，包括单词嵌入的平均值计算文章列表和正文之间单词嵌入平均值的代码如下： word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) setm = set(word2vec_model.index2word) def

我一直在开发一个python脚本来分类一篇文章是否与正文相关。为此，我一直在使用ML（SVM分类器）和一些特征，包括单词嵌入的平均值

计算文章列表和正文之间单词嵌入平均值的代码如下：

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
setm = set(word2vec_model.index2word)

def avg_feature_vector(words, model, num_features, index2word_set):
        #function to average all words vectors in a given paragraph 
        featureVec = np.zeros((num_features,), dtype="float32")
        nwords = 0
        for word in words:
            if word in index2word_set and word not in stop:
                try:
                    featureVec = np.add(featureVec, model[word])
                    nwords = nwords+1
                except:
                    pass
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

def doc_similatiry(headlines, bodies):
    X = []
    docs = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        headline_avg_vector = avg_feature_vector(lemmatize_str(clean(headline)).split(), word2vec_model, 300, setm)
        body_avg_vector = avg_feature_vector(lemmatize_str(clean(body)).split(), word2vec_model, 300, setm)
        similarity =  1 - distance.cosine(headline_avg_vector, body_avg_vector)
        X.append(similarity)
    return X, docs

似乎平均word2vec的计算是正确的。然而，它的分数比TF-IDF余弦单独的分数要差。因此，我的想法是将这两个特征分组，即将每个单词的TF-IDF分数乘以单词2VEC

下面是我的代码：

def avg_feature_vector(words, model, num_features, index2word_set, tfidf_vec, vec_repr, pos):
        #function to average all words vectors in a given paragraph (with tfidf feature)
        featureVec = np.zeros((num_features,), dtype="float32")
        nwords = 0

        for word in words:
            if word in index2word_set and word not in stop:
                try:
                    a = tfidf_vec.vocabulary_[word]
                    featureVec = np.add(featureVec, model[word]) * vec_repr[pos, a]
                    nwords = nwords+1
                except:
                    pass    
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

def doc_similatiry_with_tfidf(headlines, bodies):

    X = []
    docs = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        docs.append(lemmatize_str(clean(headline)))
        docs.append(lemmatize_str(clean(body)))
    vectorizer = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=True, stop_words=stop, sublinear_tf=True)
    sklearn_representation = vectorizer.fit_transform(docs)

    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        a = (clean(headline))
        headline_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i)
        a = (clean(body))
        body_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i+1)

        similarity =  1 - distance.cosine(headline_avg_vector, body_avg_vector)
        X.append(similarity)

    return X, docs

我的问题是这个方法得到了糟糕的结果，我不知道是否有一些逻辑可以解释这一点（因为理论上它应该有更好的结果），或者我的代码中是否有错误

有人能帮我弄清楚吗？此外，我对解决这个问题的新方法持开放态度

注意：这里使用了一些函数，我没有发布代码，因为我认为它们不是必需的。如果您有什么不明白的地方，我会在这里更好地解释。

您想计算段落的加权tf idf值吗？