Python TF-IDF得分的单词嵌入平均值

Python TF-IDF得分的单词嵌入平均值,python,machine-learning,nlp,tf-idf,word2vec,Python,Machine Learning,Nlp,Tf Idf,Word2vec,我一直在开发一个python脚本来分类一篇文章是否与正文相关。为此,我一直在使用ML(SVM分类器)和一些特征,包括单词嵌入的平均值 计算文章列表和正文之间单词嵌入平均值的代码如下: word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) setm = set(word2vec_model.index2word) def

我一直在开发一个python脚本来分类一篇文章是否与正文相关。为此,我一直在使用ML(SVM分类器)和一些特征,包括单词嵌入的平均值

计算文章列表和正文之间单词嵌入平均值的代码如下:

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
setm = set(word2vec_model.index2word)

def avg_feature_vector(words, model, num_features, index2word_set):
        #function to average all words vectors in a given paragraph 
        featureVec = np.zeros((num_features,), dtype="float32")
        nwords = 0
        for word in words:
            if word in index2word_set and word not in stop:
                try:
                    featureVec = np.add(featureVec, model[word])
                    nwords = nwords+1
                except:
                    pass
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

def doc_similatiry(headlines, bodies):
    X = []
    docs = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        headline_avg_vector = avg_feature_vector(lemmatize_str(clean(headline)).split(), word2vec_model, 300, setm)
        body_avg_vector = avg_feature_vector(lemmatize_str(clean(body)).split(), word2vec_model, 300, setm)
        similarity =  1 - distance.cosine(headline_avg_vector, body_avg_vector)
        X.append(similarity)
    return X, docs
似乎平均word2vec的计算是正确的。然而,它的分数比TF-IDF余弦单独的分数要差。因此,我的想法是将这两个特征分组,即将每个单词的TF-IDF分数乘以单词2VEC

下面是我的代码:

def avg_feature_vector(words, model, num_features, index2word_set, tfidf_vec, vec_repr, pos):
        #function to average all words vectors in a given paragraph (with tfidf feature)
        featureVec = np.zeros((num_features,), dtype="float32")
        nwords = 0

        for word in words:
            if word in index2word_set and word not in stop:
                try:
                    a = tfidf_vec.vocabulary_[word]
                    featureVec = np.add(featureVec, model[word]) * vec_repr[pos, a]
                    nwords = nwords+1
                except:
                    pass    
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

def doc_similatiry_with_tfidf(headlines, bodies):

    X = []
    docs = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        docs.append(lemmatize_str(clean(headline)))
        docs.append(lemmatize_str(clean(body)))
    vectorizer = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=True, stop_words=stop, sublinear_tf=True)
    sklearn_representation = vectorizer.fit_transform(docs)

    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        a = (clean(headline))
        headline_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i)
        a = (clean(body))
        body_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i+1)

        similarity =  1 - distance.cosine(headline_avg_vector, body_avg_vector)
        X.append(similarity)

    return X, docs
我的问题是这个方法得到了糟糕的结果,我不知道是否有一些逻辑可以解释这一点(因为理论上它应该有更好的结果),或者我的代码中是否有错误

有人能帮我弄清楚吗?此外,我对解决这个问题的新方法持开放态度


注意:这里使用了一些函数,我没有发布代码,因为我认为它们不是必需的。如果您有什么不明白的地方,我会在这里更好地解释。

您想计算段落的加权tf idf值吗?