Nlp 如何使用word2vec计算k-means的文本文档之间的距离?

Nlp 如何使用word2vec计算k-means的文本文档之间的距离?,nlp,k-means,word2vec,Nlp,K Means,Word2vec,我最近被介绍到word2vec,我在试图弄清楚它到底是如何用于k-means聚类时遇到了一些困难 我确实理解k-means如何处理tf-idf向量。对于每个文本文档,您都有一个tf idf值向量,在选择一些文档作为初始聚类中心后,可以使用欧几里德距离来最小化文档向量之间的距离。这是一个例子 但是,当使用word2vec时,每个单词都表示为一个向量。这是否意味着每个文档都对应一个矩阵?如果是这样,那么如何计算与其他文本文档的最小距离w.r.t 问题:如何使用word2vec计算k-means文本文

我最近被介绍到word2vec,我在试图弄清楚它到底是如何用于k-means聚类时遇到了一些困难

我确实理解k-means如何处理tf-idf向量。对于每个文本文档,您都有一个tf idf值向量,在选择一些文档作为初始聚类中心后,可以使用欧几里德距离来最小化文档向量之间的距离。这是一个例子

但是,当使用word2vec时,每个单词都表示为一个向量。这是否意味着每个文档都对应一个矩阵?如果是这样,那么如何计算与其他文本文档的最小距离w.r.t

问题:如何使用word2vec计算k-means文本文档之间的距离

<> > >编辑:更详细地解释我的困惑,请考虑以下代码:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences_tfidf)
print(tfidf_matrix.toarray())

model = Word2Vec(sentences_word2vec, min_count=1)

word2vec_matrix = model[model.wv.vocab]
print(len(word2vec_matrix))
for i in range(0,len(word2vec_matrix)):
    print(X[i])
[[ 0.          0.55459491  0.          0.          0.35399075  0.          0.
   0.          0.          0.          0.          0.          0.          0.437249
   0.35399075  0.35399075  0.35399075  0.        ]
 [ 0.          0.          0.          0.44302215  0.2827753   0.          0.
   0.          0.34928375  0.          0.          0.          0.34928375
   0.          0.2827753   0.5655506   0.2827753   0.        ]
 [ 0.          0.          0.35101741  0.          0.          0.27674616
   0.35101741  0.          0.          0.35101741  0.          0.35101741
   0.27674616  0.27674616  0.44809973  0.          0.          0.27674616]
 [ 0.40531999  0.          0.          0.          0.2587105   0.31955894
   0.          0.40531999  0.31955894  0.          0.40531999  0.          0.
   0.          0.          0.2587105   0.2587105   0.31955894]]
20
[  4.08335682e-03  -4.44161100e-03   3.92342824e-03   3.96498619e-03
   6.99949533e-06  -2.14108804e-04   1.20419310e-03  -1.29191438e-03
   1.64671184e-03   3.41688609e-03  -4.94929403e-03   2.90348311e-03
   4.23802016e-03  -3.01274913e-03  -7.36164337e-04   3.47558968e-03
  -7.02908786e-04   4.73567843e-03  -1.42914290e-03   3.17237526e-03
   9.36070050e-04  -2.23833631e-04  -4.03443904e-04   4.97530040e-04
  -4.82502300e-03   2.42140982e-03  -3.61089432e-03   3.37070058e-04
  -2.09900597e-03  -1.82093668e-03  -4.74618562e-03   2.41499138e-03
  -2.15628324e-03   3.43719614e-03   7.50159554e-04  -2.05973233e-03
   1.92534993e-03   1.96503079e-03  -2.02400610e-03   3.99564439e-03
   4.95056808e-03   1.47033704e-03  -2.80071306e-03   3.59585625e-04
  -2.77896033e-04  -3.21732066e-03   4.36303904e-03  -2.16396619e-03
   2.24438333e-03  -4.50925855e-03  -4.70488053e-03   6.30825118e-04
   3.81869613e-03   3.75767215e-03   5.01064525e-04   1.70175335e-03
  -1.26033701e-04  -7.43318116e-04  -6.74833194e-04  -4.76678275e-03
   1.53754558e-03   2.32421421e-03  -3.23472451e-03  -8.32759659e-04
   4.67014220e-03   5.15853462e-04  -1.15449808e-03  -1.63017167e-03
  -2.73897988e-03  -3.95627553e-03   4.04657237e-03  -1.79282576e-03
  -3.26930732e-03   2.85121426e-03  -2.33304151e-03  -2.01760884e-03
  -3.33597139e-03  -1.19233003e-03  -2.12347694e-03   4.36858647e-03
   2.00414215e-03  -4.23572073e-03   4.98410035e-03   1.79121632e-03
   4.81655030e-03   3.33247939e-03  -3.95260006e-03   1.19335402e-03
   4.61675343e-04   6.09758368e-04  -4.74696746e-03   4.91552567e-03
   1.74517138e-03   2.36604619e-03  -3.06009664e-04   3.62954312e-03
   3.56943789e-03   2.92139384e-03  -4.27138479e-03  -3.51175456e-03]
[ -4.14272398e-03   3.45513038e-03  -1.47538856e-04  -2.02292087e-03
  -2.96578306e-04   1.88684417e-03  -2.63865804e-03   2.69249966e-03
   4.57606697e-03   2.19206396e-03   2.01336667e-03   1.47434452e-03
   1.88332598e-03  -1.14452699e-03  -1.35678309e-03  -2.02636060e-04
  -3.26160830e-03  -3.95368552e-03   1.40415027e-03   2.30542314e-03
  -3.18884710e-03  -4.46776347e-03   3.96415358e-03  -2.07852037e-03
   4.98413946e-03  -6.43568579e-04  -2.53325375e-03   1.30117545e-03
   1.26555841e-03  -8.84680718e-04  -8.34991166e-04  -4.15050285e-03
   4.66807076e-04   1.71844949e-04   1.08140183e-03   4.37910948e-03
  -3.28412466e-03   2.09890743e-04   2.29888223e-03   4.70223464e-03
  -2.31004297e-03  -5.10134443e-04   2.57104915e-03  -2.55978899e-03
  -7.55646848e-04  -1.98197929e-04   1.20443532e-04   4.63618943e-03
   1.13036349e-05   8.16594984e-04  -1.65917678e-03   3.29331891e-03
  -4.97825304e-03  -2.03667139e-03   3.60272871e-03   7.44500838e-04
  -4.40325850e-04   6.38399797e-04  -4.23364760e-03  -4.56386572e-03
   4.77551389e-03   4.74880403e-03   7.06148741e-04  -1.24937459e-03
  -9.50689311e-04  -3.88551364e-03  -4.45985980e-03  -1.15060725e-03
   3.27067473e-03   4.54987818e-03   2.62327422e-03  -2.40981602e-03
   4.55576897e-04   3.19155119e-03  -3.84227419e-03  -1.17610034e-03
  -1.45622855e-03  -4.32460709e-03  -4.12792247e-03  -1.74557802e-03
   4.66075348e-04   3.39668151e-03  -4.00651991e-03   1.41077011e-03
  -7.89384532e-04  -6.56061340e-04   1.14822399e-03   4.12205653e-03
   3.60721885e-03  -3.11746349e-04   1.44255662e-03   3.11965472e-03
  -4.93455213e-03   4.80490318e-03   2.79991422e-03   4.93505970e-03
   3.69034940e-03   4.76422161e-03  -1.25827035e-03  -1.94680784e-03]
                                  ...

[ -3.92252317e-04  -3.66805331e-03   1.52376946e-03  -3.81564132e-05
  -2.57118000e-03  -4.46725264e-03   2.36480637e-03  -4.70252614e-03
  -4.18651942e-03   4.54758806e-03   4.38804098e-04   1.28351408e-03
   3.40470579e-03   1.00038981e-03  -1.06557179e-03   4.67202952e-03
   4.50591929e-03  -2.67829909e-03   2.57702312e-03  -3.65824508e-03
  -4.54068230e-03   2.20785337e-03  -1.00554363e-03   5.14690124e-04
   4.64830594e-03   1.91410910e-03  -4.83837258e-03   6.73376708e-05
  -2.37796479e-03  -4.45193471e-03  -2.60163331e-03   1.51159777e-03
   4.06868104e-03   2.55690538e-04  -2.54662265e-03   2.64597777e-03
  -2.62586889e-03  -2.71554058e-03   5.49281889e-04  -1.38776843e-03
  -2.94354092e-03  -1.13887887e-03   4.59292997e-03  -1.02300232e-03
   2.27600057e-03  -4.88117011e-03   1.95790920e-03   4.64376673e-04
   2.56658648e-03   8.90390365e-04  -1.40368659e-03  -6.40658545e-04
  -3.53228673e-03  -1.30717538e-03  -1.80223631e-03   2.94505036e-03
  -4.82233381e-03  -2.16079340e-03   2.58940039e-03   1.60595961e-03
  -1.22245611e-03  -6.72614493e-04   4.47060820e-03  -4.95934719e-03
   2.70283176e-03   2.93257344e-03   2.13279200e-04   2.59435410e-03
   2.98801321e-03  -2.79974379e-03  -1.49789048e-04  -2.53924704e-03
  -7.83207070e-04   1.18357304e-03  -1.27669750e-03  -4.16665291e-03
   1.40916929e-03   1.63017987e-07   1.36708119e-03  -1.26687710e-05
   1.24729215e-03  -2.50442210e-03  -3.20308795e-03  -1.41550787e-03
  -1.05747324e-03  -3.97984264e-03   2.25877413e-03  -1.28316227e-03
   3.60359484e-03  -1.97929185e-04   3.21712159e-03  -4.96298913e-03
  -1.83640339e-03  -9.90608009e-04  -2.03964626e-03  -4.87274351e-03
   7.24950165e-04   3.85614252e-03  -4.18979349e-03   2.73840013e-03]
它返回以下代码:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences_tfidf)
print(tfidf_matrix.toarray())

model = Word2Vec(sentences_word2vec, min_count=1)

word2vec_matrix = model[model.wv.vocab]
print(len(word2vec_matrix))
for i in range(0,len(word2vec_matrix)):
    print(X[i])
[[ 0.          0.55459491  0.          0.          0.35399075  0.          0.
   0.          0.          0.          0.          0.          0.          0.437249
   0.35399075  0.35399075  0.35399075  0.        ]
 [ 0.          0.          0.          0.44302215  0.2827753   0.          0.
   0.          0.34928375  0.          0.          0.          0.34928375
   0.          0.2827753   0.5655506   0.2827753   0.        ]
 [ 0.          0.          0.35101741  0.          0.          0.27674616
   0.35101741  0.          0.          0.35101741  0.          0.35101741
   0.27674616  0.27674616  0.44809973  0.          0.          0.27674616]
 [ 0.40531999  0.          0.          0.          0.2587105   0.31955894
   0.          0.40531999  0.31955894  0.          0.40531999  0.          0.
   0.          0.          0.2587105   0.2587105   0.31955894]]
20
[  4.08335682e-03  -4.44161100e-03   3.92342824e-03   3.96498619e-03
   6.99949533e-06  -2.14108804e-04   1.20419310e-03  -1.29191438e-03
   1.64671184e-03   3.41688609e-03  -4.94929403e-03   2.90348311e-03
   4.23802016e-03  -3.01274913e-03  -7.36164337e-04   3.47558968e-03
  -7.02908786e-04   4.73567843e-03  -1.42914290e-03   3.17237526e-03
   9.36070050e-04  -2.23833631e-04  -4.03443904e-04   4.97530040e-04
  -4.82502300e-03   2.42140982e-03  -3.61089432e-03   3.37070058e-04
  -2.09900597e-03  -1.82093668e-03  -4.74618562e-03   2.41499138e-03
  -2.15628324e-03   3.43719614e-03   7.50159554e-04  -2.05973233e-03
   1.92534993e-03   1.96503079e-03  -2.02400610e-03   3.99564439e-03
   4.95056808e-03   1.47033704e-03  -2.80071306e-03   3.59585625e-04
  -2.77896033e-04  -3.21732066e-03   4.36303904e-03  -2.16396619e-03
   2.24438333e-03  -4.50925855e-03  -4.70488053e-03   6.30825118e-04
   3.81869613e-03   3.75767215e-03   5.01064525e-04   1.70175335e-03
  -1.26033701e-04  -7.43318116e-04  -6.74833194e-04  -4.76678275e-03
   1.53754558e-03   2.32421421e-03  -3.23472451e-03  -8.32759659e-04
   4.67014220e-03   5.15853462e-04  -1.15449808e-03  -1.63017167e-03
  -2.73897988e-03  -3.95627553e-03   4.04657237e-03  -1.79282576e-03
  -3.26930732e-03   2.85121426e-03  -2.33304151e-03  -2.01760884e-03
  -3.33597139e-03  -1.19233003e-03  -2.12347694e-03   4.36858647e-03
   2.00414215e-03  -4.23572073e-03   4.98410035e-03   1.79121632e-03
   4.81655030e-03   3.33247939e-03  -3.95260006e-03   1.19335402e-03
   4.61675343e-04   6.09758368e-04  -4.74696746e-03   4.91552567e-03
   1.74517138e-03   2.36604619e-03  -3.06009664e-04   3.62954312e-03
   3.56943789e-03   2.92139384e-03  -4.27138479e-03  -3.51175456e-03]
[ -4.14272398e-03   3.45513038e-03  -1.47538856e-04  -2.02292087e-03
  -2.96578306e-04   1.88684417e-03  -2.63865804e-03   2.69249966e-03
   4.57606697e-03   2.19206396e-03   2.01336667e-03   1.47434452e-03
   1.88332598e-03  -1.14452699e-03  -1.35678309e-03  -2.02636060e-04
  -3.26160830e-03  -3.95368552e-03   1.40415027e-03   2.30542314e-03
  -3.18884710e-03  -4.46776347e-03   3.96415358e-03  -2.07852037e-03
   4.98413946e-03  -6.43568579e-04  -2.53325375e-03   1.30117545e-03
   1.26555841e-03  -8.84680718e-04  -8.34991166e-04  -4.15050285e-03
   4.66807076e-04   1.71844949e-04   1.08140183e-03   4.37910948e-03
  -3.28412466e-03   2.09890743e-04   2.29888223e-03   4.70223464e-03
  -2.31004297e-03  -5.10134443e-04   2.57104915e-03  -2.55978899e-03
  -7.55646848e-04  -1.98197929e-04   1.20443532e-04   4.63618943e-03
   1.13036349e-05   8.16594984e-04  -1.65917678e-03   3.29331891e-03
  -4.97825304e-03  -2.03667139e-03   3.60272871e-03   7.44500838e-04
  -4.40325850e-04   6.38399797e-04  -4.23364760e-03  -4.56386572e-03
   4.77551389e-03   4.74880403e-03   7.06148741e-04  -1.24937459e-03
  -9.50689311e-04  -3.88551364e-03  -4.45985980e-03  -1.15060725e-03
   3.27067473e-03   4.54987818e-03   2.62327422e-03  -2.40981602e-03
   4.55576897e-04   3.19155119e-03  -3.84227419e-03  -1.17610034e-03
  -1.45622855e-03  -4.32460709e-03  -4.12792247e-03  -1.74557802e-03
   4.66075348e-04   3.39668151e-03  -4.00651991e-03   1.41077011e-03
  -7.89384532e-04  -6.56061340e-04   1.14822399e-03   4.12205653e-03
   3.60721885e-03  -3.11746349e-04   1.44255662e-03   3.11965472e-03
  -4.93455213e-03   4.80490318e-03   2.79991422e-03   4.93505970e-03
   3.69034940e-03   4.76422161e-03  -1.25827035e-03  -1.94680784e-03]
                                  ...

[ -3.92252317e-04  -3.66805331e-03   1.52376946e-03  -3.81564132e-05
  -2.57118000e-03  -4.46725264e-03   2.36480637e-03  -4.70252614e-03
  -4.18651942e-03   4.54758806e-03   4.38804098e-04   1.28351408e-03
   3.40470579e-03   1.00038981e-03  -1.06557179e-03   4.67202952e-03
   4.50591929e-03  -2.67829909e-03   2.57702312e-03  -3.65824508e-03
  -4.54068230e-03   2.20785337e-03  -1.00554363e-03   5.14690124e-04
   4.64830594e-03   1.91410910e-03  -4.83837258e-03   6.73376708e-05
  -2.37796479e-03  -4.45193471e-03  -2.60163331e-03   1.51159777e-03
   4.06868104e-03   2.55690538e-04  -2.54662265e-03   2.64597777e-03
  -2.62586889e-03  -2.71554058e-03   5.49281889e-04  -1.38776843e-03
  -2.94354092e-03  -1.13887887e-03   4.59292997e-03  -1.02300232e-03
   2.27600057e-03  -4.88117011e-03   1.95790920e-03   4.64376673e-04
   2.56658648e-03   8.90390365e-04  -1.40368659e-03  -6.40658545e-04
  -3.53228673e-03  -1.30717538e-03  -1.80223631e-03   2.94505036e-03
  -4.82233381e-03  -2.16079340e-03   2.58940039e-03   1.60595961e-03
  -1.22245611e-03  -6.72614493e-04   4.47060820e-03  -4.95934719e-03
   2.70283176e-03   2.93257344e-03   2.13279200e-04   2.59435410e-03
   2.98801321e-03  -2.79974379e-03  -1.49789048e-04  -2.53924704e-03
  -7.83207070e-04   1.18357304e-03  -1.27669750e-03  -4.16665291e-03
   1.40916929e-03   1.63017987e-07   1.36708119e-03  -1.26687710e-05
   1.24729215e-03  -2.50442210e-03  -3.20308795e-03  -1.41550787e-03
  -1.05747324e-03  -3.97984264e-03   2.25877413e-03  -1.28316227e-03
   3.60359484e-03  -1.97929185e-04   3.21712159e-03  -4.96298913e-03
  -1.83640339e-03  -9.90608009e-04  -2.03964626e-03  -4.87274351e-03
   7.24950165e-04   3.85614252e-03  -4.18979349e-03   2.73840013e-03]
使用tfidf,k-means将由线路实施

kmeans = KMeans(n_clusters = 5)
kmeans.fit(tfidf_matrix)
kmeans = KMeans(n_clusters = 5)
kmeans.fit(word2vec_matrix)
使用word2vec,k-means将由行实现

kmeans = KMeans(n_clusters = 5)
kmeans.fit(tfidf_matrix)
kmeans = KMeans(n_clusters = 5)
kmeans.fit(word2vec_matrix)

(这里是一个使用word2vec的k-means的示例)。因此,在第一种情况下,k-means得到一个矩阵,其中包含每个文档中每个单词的tf-idf值,而在第二种情况下,k-means得到每个单词的向量。在第二种情况下,如果k-means只有word2vec表示,那么它如何对文档进行聚类呢?

既然您对文档聚类感兴趣,那么您最好使用,它可以为您的每个文档准备一个向量。然后,您可以对文档向量集应用任何聚类算法进行进一步处理。如果出于任何原因,您想使用单词向量,那么您可以做一些事情。下面是一个非常简单的方法:

  • 对于每个文档,收集该文档中TF-IDF值最高的所有单词
  • 求这些单词的Word2Vec向量的平均值,以创建整个文档的向量
  • 对平均向量应用聚类

  • 不要试图平均一个文档中的所有单词,这是行不通的

    这个问题应该在stats.stackexchange上问吗?谢谢你的回答!那么你的意思是,k-means算法通过平均每个文档中包含的词的向量来为每个文档分配一个向量?(不包括stopwords和词干)不,我是说你应该使用我上面给出的过程自己准备文档向量,然后对这些文档向量应用k-means。据我所知,您的目标是对文档进行集群,不是吗?我明白了,我很抱歉!是的,确实如此!关于使用tf idf值作为阈值,您有什么参考吗?关于找到阈值的最佳值,您可以进行一些实验。您对每个文档使用的单词越独特(即阈值越高),您得到的结果就越好。另一方面,确保每个文档至少有两个三个单词的平均值(不确定文档的长度)。