Python &引用；“正常化”；将句子的数据框转换为更大的单词数据框_Python_Apache Spark_Dataframe_Pyspark_Apache Spark Sql

Python &引用；“正常化”；将句子的数据框转换为更大的单词数据框

python apache-spark dataframe pyspark

Python &引用；“正常化”；将句子的数据框转换为更大的单词数据框,python,apache-spark,dataframe,pyspark,apache-spark-sql,Python,Apache Spark,Dataframe,Pyspark,Apache Spark Sql,使用Python和Spark：假设我有一个数据框，其中的行包含句子，我如何将句子数据框（来自DBMS术语）规范化为另一个数据框，每行包含一个从句子中拆分出来的单词我想这是最重要的例如，假设df_语句如下所示： [Row(sentence_id=1, sentence=u'the dog ran the fastest.'), Row(sentence_id=2, sentence=u'the cat sat down.')] 我正在寻找将df_句子转换为df_单词，这将占用这两行并构建

使用Python和Spark：

假设我有一个数据框，其中的行包含句子，我如何将句子数据框（来自DBMS术语）规范化为另一个数据框，每行包含一个从句子中拆分出来的单词

我想这是最重要的

例如，假设

df_语句

如下所示：

[Row(sentence_id=1, sentence=u'the dog ran the fastest.'),
 Row(sentence_id=2, sentence=u'the cat sat down.')]

我正在寻找将

df_句子

转换为

df_单词

，这将占用这两行并构建一个更大的（行数）数据帧，如下所示。请注意，新表中包含了一句_id：

[Row(sentence_id=1, word=u'the'),
 Row(sentence_id=1, word=u'the'),
 Row(sentence_id=1, word=u'fastest'), 
 Row(sentence_id=2, word=u'dog'),
 Row(sentence_id=2, word=u'ran'), 
 Row(sentence_id=2, word=u'cat'), 
 ...clip...]

现在，我对行数或唯一的单词并不感兴趣，这是因为我想加入

句子id

上的其他RDD，以获取我存储在其他地方的其他有趣数据

我怀疑spark中的很多功能都是围绕管道中的这些间歇转换而来的，所以我想了解做事情的最佳方式，并开始收集我自己的代码片段等。

其实很简单。让我们从创建

数据帧开始：
from pyspark.sql import Row

df = sc.parallelize([
    Row(sentence_id=1, sentence=u'the dog ran the fastest.'),
     Row(sentence_id=2, sentence=u'the cat sat down.')
]).toDF()

接下来，我们需要一个标记器：
from pyspark.ml.feature import RegexTokenizer

tokenizer = RegexTokenizer(
    inputCol="sentence", outputCol="words", pattern="\\W+")
tokenized = tokenizer.transform(df)

最后，我们放下句子，分解单词：
from pyspark.sql.functions import explode, col

transformed = (tokenized
    .drop("sentence")
    .select(col("sentence_id"), explode(col("words")).alias("word")))

最后，结果是：
transformed.show()

## +-----------+-------+
## |sentence_id|   word|
## +-----------+-------+
## |          1|    the|
## |          1|    dog|
## |          1|    ran|
## |          1|    the|
## |          1|fastest|
## |          2|    the|
## |          2|    cat|
## |          2|    sat|
## |          2|   down|
## +-----------+-------+

注释：

根据数据的不同，explode

可能会相当昂贵，因为它会复制其他列。在应用

explode

之前，请确保应用所有可以应用的过滤器，例如使用

StopWordsRemover

遵循此处的文档-我认为您可以使用

flatMap

，获取新的RDD并创建新的数据帧。