Apache spark Pyspark将PipelinedRDD转换为Spark数据帧_Apache Spark_Pyspark_Apache Spark Sql

Apache spark Pyspark将PipelinedRDD转换为Spark数据帧

apache-spark pyspark

Apache spark Pyspark将PipelinedRDD转换为Spark数据帧,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我使用的是Spark 2.3.1，当我打印RDD类型时，我在Spark中执行NLP，它显示，执行时 rdd.collect（） PipelineRDD上的命令，其输出为本发明的实施例包括配对两个无线设备放置至少一个两设备配对模式执行至少一个配对运动事件至少一个无线设备满足至少一个配对条件检测满足至少一个配对条件配对两个无线设备响应检测满足至少一个配对条件在所提供的许多方面上，本发明涉及无线通信系统，具体地说，本发明涉及发送控制信息的方法pucch无线通信系统设备，包括获得对应于多个scfdma

我使用的是Spark 2.3.1，当我打印RDD类型时，我在Spark中执行NLP，它显示，执行时

rdd.collect（）

PipelineRDD上的命令，其输出为

本发明的实施例包括配对两个无线设备放置至少一个两设备配对模式执行至少一个配对运动事件至少一个无线设备满足至少一个配对条件检测满足至少一个配对条件配对两个无线设备响应检测满足至少一个配对条件在所提供的许多方面上，本发明涉及无线通信系统，具体地说，本发明涉及发送控制信息的方法pucch无线通信系统设备，包括获得对应于多个scfdma单载波频分复用器的多个第二调制符号流的步骤多路复用符号扩散多个第一调制符号流形成第一调制符号流对应第一时隙内的scfdma符号获得多个复数符号流执行dft离散傅立叶变换预编码处理多个第二调制符号流发送多个复数符号流pucch wherein多个第二调制符号流加扰scfdma符号级dog church aardwolf abacus']

我想创建一个这样的数据框，将每个单词添加到数据框的行中

+--------------+
|    text      |
+--------------+
|  embodiment  |
|  present     |
|  invention   |
....
....
|  aardwolf    |
|  abacus      |
+--------------+

这是我的密码

import pyspark
import nltk
import string


from pyspark import SparkContext
from nltk.stem import WordNetLemmatizer

from pyspark.ml.feature import NGram
from pyspark.sql.types import ArrayType,StructType,StructField,StringType

from pyspark.sql import SparkSession


sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName('Spark Example').getOrCreate()

Source_path="Folder_of_multiple_text_file"


data=sc.textFile(Source_path)

lower_casetext = data.map(lambda x:x.lower())



# splitting_rdd = lower_casetext.map(lambda x:x.split(" "))
# print(splitting_rdd.collect())


# Function to perform Sentence tokeniaztion
def sent_TokenizeFunct(x):
    return nltk.sent_tokenize(x)

sentencetokenization_rdd = lower_casetext.map(sent_TokenizeFunct)

# Function to perform Word tokenization

def word_TokenizeFunct(x):
    splitted = [word for line in x for word in line.split()]
    return splitted

wordtokenization_rdd = sentencetokenization_rdd.map(word_TokenizeFunct)


# Remove Stop Words

def removeStopWordsFunct(x):
    from nltk.corpus import stopwords
    stop_words=set(stopwords.words('english'))
    filteredSentence = [w for w in x if not w in stop_words]
    return filteredSentence
stopwordRDD = wordtokenization_rdd.map(removeStopWordsFunct)


# Remove Punctuation marks

def removePunctuationsFunct(x):
    list_punct=list(string.punctuation)
    filtered = [''.join(c for c in s if c not in list_punct) for s in x] 
    filtered_space = [s for s in filtered if s] #remove empty space 
    return filtered
rmvPunctRDD = stopwordRDD.map(removePunctuationsFunct)

# Perform Lemmatization

def lemma(x):

    lemmatizer = WordNetLemmatizer()

    final_rdd = [lemmatizer.lemmatize(s) for s in x]
    return final_rdd

lem_wordsRDD = rmvPunctRDD.map(lemma)

# Join tokens

def joinTokensFunct(x):
    joinedTokens_list = []
    x = " ".join(x)
    return x

joinedTokensRDD = lem_wordsRDD.map(joinTokensFunct)


print(joinedTokensRDD.collect())
print(type(joinedTokensRDD))

类似这样，但要相应地进行调整：

data = [('Category A', 100, "This is category A"),
        ('Category B', 120, "This is category B"),
        ('Category C', 150, "This is category C")]

rdd = spark.sparkContext.parallelize(data)
rdd.collect

# generate a pipelined RDD with some dummy logic
rdd = rdd.filter(lambda x: x[2] == x[2])

from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

我已经通过删除join Tokens步骤更改了代码，并通过以下代码将lem_wordsRDD直接转换为数据帧

df = lem_wordsRDD.map(lambda x: (x, )).toDF(["features"])

explod_df = df.withColumn("values", explode("features"))

tokenized_df = explod_df.select("values")
tokenized_df.show()

我已经添加了有问题的代码，请注意这一点，同时使用您的解决方案，例如

schema=StructType（[StructField（'text'，StringType（），True）]df=spark.createDataFrame（joinedTokensRDD，schema）print（df.schema）df.show（）

我遇到了类似TypeError的错误：StructType无法接受您需要调整的类型中的对象“file\u text”。这是一个我可以接受的示例，但我无法理解错误TypeError:StructType无法接受类型中的对象“file\u text”这是什么意思？抱歉，无法理解