Python TF-IDF得分的单词嵌入平均值
我一直在开发一个python脚本来分类一篇文章是否与正文相关。为此,我一直在使用ML(SVM分类器)和一些特征,包括单词嵌入的平均值 计算文章列表和正文之间单词嵌入平均值的代码如下:Python TF-IDF得分的单词嵌入平均值,python,machine-learning,nlp,tf-idf,word2vec,Python,Machine Learning,Nlp,Tf Idf,Word2vec,我一直在开发一个python脚本来分类一篇文章是否与正文相关。为此,我一直在使用ML(SVM分类器)和一些特征,包括单词嵌入的平均值 计算文章列表和正文之间单词嵌入平均值的代码如下: word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) setm = set(word2vec_model.index2word) def
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
setm = set(word2vec_model.index2word)
def avg_feature_vector(words, model, num_features, index2word_set):
#function to average all words vectors in a given paragraph
featureVec = np.zeros((num_features,), dtype="float32")
nwords = 0
for word in words:
if word in index2word_set and word not in stop:
try:
featureVec = np.add(featureVec, model[word])
nwords = nwords+1
except:
pass
if(nwords>0):
featureVec = np.divide(featureVec, nwords)
return featureVec
def doc_similatiry(headlines, bodies):
X = []
docs = []
for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
headline_avg_vector = avg_feature_vector(lemmatize_str(clean(headline)).split(), word2vec_model, 300, setm)
body_avg_vector = avg_feature_vector(lemmatize_str(clean(body)).split(), word2vec_model, 300, setm)
similarity = 1 - distance.cosine(headline_avg_vector, body_avg_vector)
X.append(similarity)
return X, docs
似乎平均word2vec的计算是正确的。然而,它的分数比TF-IDF余弦单独的分数要差。因此,我的想法是将这两个特征分组,即将每个单词的TF-IDF分数乘以单词2VEC
下面是我的代码:
def avg_feature_vector(words, model, num_features, index2word_set, tfidf_vec, vec_repr, pos):
#function to average all words vectors in a given paragraph (with tfidf feature)
featureVec = np.zeros((num_features,), dtype="float32")
nwords = 0
for word in words:
if word in index2word_set and word not in stop:
try:
a = tfidf_vec.vocabulary_[word]
featureVec = np.add(featureVec, model[word]) * vec_repr[pos, a]
nwords = nwords+1
except:
pass
if(nwords>0):
featureVec = np.divide(featureVec, nwords)
return featureVec
def doc_similatiry_with_tfidf(headlines, bodies):
X = []
docs = []
for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
docs.append(lemmatize_str(clean(headline)))
docs.append(lemmatize_str(clean(body)))
vectorizer = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=True, stop_words=stop, sublinear_tf=True)
sklearn_representation = vectorizer.fit_transform(docs)
for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
a = (clean(headline))
headline_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i)
a = (clean(body))
body_avg_vector = avg_feature_vector(nltk.word_tokenize(a), word2vec_model, 300, setm, vectorizer, sklearn_representation, 2*i+1)
similarity = 1 - distance.cosine(headline_avg_vector, body_avg_vector)
X.append(similarity)
return X, docs
我的问题是这个方法得到了糟糕的结果,我不知道是否有一些逻辑可以解释这一点(因为理论上它应该有更好的结果),或者我的代码中是否有错误
有人能帮我弄清楚吗?此外,我对解决这个问题的新方法持开放态度
注意:这里使用了一些函数,我没有发布代码,因为我认为它们不是必需的。如果您有什么不明白的地方,我会在这里更好地解释。您想计算段落的加权tf idf值吗?