Python 了解Word2Vec变换方法的输出_Python_Apache Spark_Pyspark_Apache Spark Mllib_Apache Spark Ml

Python 了解Word2Vec变换方法的输出

python apache-spark pyspark

Python 了解Word2Vec变换方法的输出,python,apache-spark,pyspark,apache-spark-mllib,apache-spark-ml,Python,Apache Spark,Pyspark,Apache Spark Mllib,Apache Spark Ml,我试图理解Spark的word2vec算法的输出我有一个文本列的数据框，我标记了它，所以现在我将每个文本作为一列中的单词列表 +--------------------+ | tokenised_text| +--------------------+ |[if, you, hate, d...| |[rampant, teen, s...| |[the, united, sta...| |[reuters, health,...| |[brussels, (reute...| +-

我试图理解Spark的word2vec算法的输出

我有一个文本列的数据框，我标记了它，所以现在我将每个文本作为一列中的单词列表

+--------------------+
|      tokenised_text|
+--------------------+
|[if, you, hate, d...|
|[rampant, teen, s...|
|[the, united, sta...|
|[reuters, health,...|
|[brussels, (reute...|
+--------------------+

现在，我运行word2vec算法，如下所示：

  word2Vec = Word2Vec(vectorSize=100, seed=42, inputCol="tokenised_text", outputCol="model")
  w2vmodel = word2Vec.fit(tokensDf)
  w2vdf=w2vmodel.transform(tokensDf)

现在，当我用word2vec模型对象转换这个原始数据帧时，我得到了另一个添加到数据帧的列，它有100维向量。数据框的第一行在下面

[Row(tokenised_text=[u'if', u'you', u'hate', u'dealing', u'with', u'bank', u'tellers', u'or', u'customer', u'service', u'representatives,', u'then', u'the', u'royal', u'bank', u'of', u'scotland', u'might', u'have', u'a', u'solution', u'for', u'you.if', u'this', u'program', u'is', u'successful,', u'it', u'could', u'be', u'a', u'big', u'step', u'forward', u'on', u'the', u'road', u'to', u'automated', u'customer', u'service', u'through', u'the', u'use', u'of', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'for', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'that', u'luvo', u'does', u'not', u'operate', u'via', u'a', u'third-party', u'app', u'such', u'as', u'facebook', u'messenger,', u'wechat,', u'or', u'kik,', u'all', u'of', u'which', u'are', u'currently', u'trying', u'to', u'create', u'bots', u'that', u'would', u'assist', u'in', u'customer', u'service', u'within', u'their', u'respective', u'platforms.luvo', u'would', u'be', u'available', u'through', u'the', u'web', u'and', u'through', u'smartphones.', u'it', u'would', u'also', u'use', u'machine', u'learning', u'to', u'learn', u'from', u'its', u'mistakes,', u'which', u'should', u'ultimately', u'help', u'with', u'its', u'response', u'accuracy.down', u'the', u'road,', u'luvo', u'would', u'become', u'a', u'supplement', u'to', u'the', u'human', u'staff.', u'it', u'can', u'currently', u'answer', u'20', u'set', u'questions', u'but', u'as', u'that', u'number', u'grows,', u'it', u'would', u'allow', u'the', u'human', u'employees', u'to', u'more', u'complicated', u'issues.', u'if', u'a', u'problem', u'is', u'beyond', u"luvo's", u'comprehension,', u'then', u'it', u'would', u'refer', u'the', u'customer', u'to', u'a', u'bank', u'employee;', u'however,\xa0a', u'user', u'could', u'choose', u'to', u'speak', u'with', u'a', u'human', u'instead', u'of', u'luvo', u'anyway.ai', u'such', u'as', u'luvo,', u'if', u'successful,', u'could', u'help', u'businesses', u'become', u'more', u'efficient', u'and', u'increase', u'their', u'productivity,', u'while', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'which', u'would', u'consequently\xa0save', u'money', u'that', u'would', u'otherwise', u'go', u'toward', u'manpower.and', u'this', u'trend', u'is', u'already', u'starting.', u'google,', u'microsoft,', u'and', u'ibm', u'are', u'investing', u'significantly', u'into', u'ai', u'research.', u'furthermore,', u'the', u'global', u'ai', u'market', u'is', u'estimated', u'to', u'grow', u'from', u'approximately', u'$420', u'million', u'in', u'2014', u'to', u'$5.05', u'billion', u'in', u'2020,', u'according', u'to', u'a', u'forecast', u'by', u'research', u'and', u'markets.\xa0the', u'move', u'toward', u'ai', u'would', u'be', u'just', u'one', u'more', u'way', u'in', u'which', u'the', u'digital', u'age', u'is', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'are', u'increasingly', u'moving', u'toward', u'digital', u'banking,', u'and', u'as', u'a', u'result,', u"they're", u'walking', u'into', u'their', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'less', u'often', u'than', u'ever', u'before.'], model=DenseVector([-0.036, 0.0759, 0.196, 0.0379, 0.0331, 0.069, -0.1531, 0.0588, -0.1662, -0.0624, -0.0924, -0.0304, 0.0155, -0.0245, -0.0507, 0.0809, 0.0199, -0.0364, 0.0703, 0.0469, 0.0768, -0.0214, 0.0404, 0.0522, -0.0506, 0.0095, 0.1129, 0.0515, -0.0867, 0.0224, -0.0499, 0.0848, 0.1583, -0.0882, -0.0262, -0.0083, -0.0019, -0.0172, 0.0554, 0.0478, -0.0328, 0.1219, 0.0153, -0.1409, -0.0262, 0.0829, -0.1318, -0.0952, -0.1854, 0.0837, 0.0084, -0.0004, 0.0172, 0.0073, 0.1217, 0.0137, -0.0735, -0.0481, -0.0223, -0.0708, -0.0617, -0.0049, -0.0069, -0.0211, 0.0615, -0.0919, 0.0509, 0.0871, -0.0278, -0.0295, -0.2326, -0.0931, -0.1146, 0.0371, -0.0024, 0.0294, -0.0177, 0.0384, 0.019, 0.0767, -0.0922, -0.0418, 0.0005, 0.0221, -0.0624, 0.0149, -0.0496, -0.0434, 0.1202, -0.0305, 0.1478, -0.0385, -0.0342, 0.0798, 0.0302, -0.013, 0.0923, -0.0287, -0.0976, -0.0634]))]

然而，我无法理解这个输出。我的token列有多个单词token。在word2vec中，每个单词都表示为100维的向量。所以理想情况下，我会有多个这样的100维向量，对应于令牌列表的每个字。但我们只得到一个100维向量。那就相当于一个这样的词。那么每行数据帧的令牌列表中的其他单词呢

我确信我在这里遗漏了一些东西，但是Spark文档写得非常糟糕，文档中没有一个方法非常有助于理解

来自：

将句子列转换为向量列以表示整个句子。变换是通过平均它包含的所有字向量来执行的

关于本声明：

理想情况下，我会有多个这样的100维向量

您必须记住，此

转换

文档，因此输出应该是单个向量。有许多更复杂的技术可以组合单词嵌入，如果您想使用这些技术，可以轻松地从

mllib

model（）提取映射