使用Scala将特定格式的文本文件转换为Spark中的数据帧
我试图通过Scala将对话转换为spark中的数据帧。人员及其消息由制表符空格长度分隔。每段对话都换了一行 文本文件如下所示:使用Scala将特定格式的文本文件转换为Spark中的数据帧,scala,apache-spark,Scala,Apache Spark,我试图通过Scala将对话转换为spark中的数据帧。人员及其消息由制表符空格长度分隔。每段对话都换了一行 文本文件如下所示: alpha hello,beta! how are you? beta I am fine alpha.How about you? alpha I am also doing fine... alpha Actually, beta, I am bit busy nowadays and sorry I hadn't call U --------
alpha hello,beta! how are you?
beta I am fine alpha.How about you?
alpha I am also doing fine...
alpha Actually, beta, I am bit busy nowadays and sorry I hadn't call U
------------------------------------
|Person | Message
------------------------------------
|1 | hello,beta! how are you?
|2 | I am fine alpha.How about you?
|1 | I am also doing fine...
|1 | Actually, beta, I am bit busy nowadays and sorry I hadn't call
-------------------------------------
我需要数据帧,如下所示:
alpha hello,beta! how are you?
beta I am fine alpha.How about you?
alpha I am also doing fine...
alpha Actually, beta, I am bit busy nowadays and sorry I hadn't call U
------------------------------------
|Person | Message
------------------------------------
|1 | hello,beta! how are you?
|2 | I am fine alpha.How about you?
|1 | I am also doing fine...
|1 | Actually, beta, I am bit busy nowadays and sorry I hadn't call
-------------------------------------
如果您读取文本文件并对其进行分析: 例如:
val result: Dataset[(String, String)] = sparkSession.read.textFile("filePath").flatMap {
line =>
val str = line.split("\t")
if (str.length == 2) {
Some((str(0), str(1)))
}
else {
//in case if you want to ignore malformed line
None
}
}
首先,我用您提供的数据创建了一个文本文件,并将其放在temp/data.txt下的HDFS位置 data.txt:
alpha hello,beta! how are you?
beta I am fine alpha.How about you?
alpha I am also doing fine...
alpha Actually, beta, I am bit busy nowadays and sorry I hadn't call U
然后,我创建了一个case类,读入该文件,并将其处理为一个数据帧:
case类PersonMessage(Person:String,Message:String)
val df=sc.textFile(“temp/data.txt”).map(x=>{
val splits=x.split(“\t”)
PersonMessage(拆分(0),拆分(1))
}).toDF(“人”、“信息”)
df.show
你能分享你的代码吗?我实际上是scala的初学者,我只是在这方面有了一些进展。我现在正在学习复杂的映射函数,就像这个问题一样val text=sc.textFile(“hdfs://localhost:9000/Conversation“”.map(x=>x.split(“\n”)val text2=text.foreach(x=>x.map(y=>y.split(“”))```