Scala 如何按“拆分字符串”|&引用;(管道)并将RDD转换为数据帧

Scala 如何按“拆分字符串”|&引用;(管道)并将RDD转换为数据帧,scala,apache-spark,apache-spark-sql,rdd,Scala,Apache Spark,Apache Spark Sql,Rdd,我试图读取一个包含产品信息的文本文件,该文件是分开的。当我试图将数据读取为RDD,然后使用分隔符|拆分数据时,数据就会损坏。我无法理解为什么会发生这种情况 ####输入数据 productId | price | saleEvent | rivalName | fetchs 12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30 12345 | 45.52 |普通| ShopYourWay.com | 2017-05-11 16

我试图读取一个包含产品信息的文本文件,该文件是分开的。当我试图将数据读取为RDD,然后使用分隔符
|
拆分数据时,数据就会损坏。我无法理解为什么会发生这种情况

####输入数据
productId | price | saleEvent | rivalName | fetchs
12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30
12345 | 45.52 |普通| ShopYourWay.com | 2017-05-11 16:09:43
12345 | 89.52 | Sale | MarketPlace.com | 2017-05-11 16:07:29
678 | 1348.73 |常规| VistaCart.com | 2017-05-11 15:58:06
678 | 1348.73 | Special | ShopYourWay.com | 2017-05-11 15:44:22
678 | 1232.29 | Daily | MarketPlace.com | 2017-05-11 15:53:03
777 | 908.57 | Daily | VistaCart.com | 2017-05-11 15:39:01
####火花壳代码
import org.apache.spark.sql.Encoder; import spark.implicits._
import org.apache.spark.sql.Encoder

case class Product(productId:Int, price:Double, saleEvent:String, rivalName:String, fetchTS:String)
val rdd = spark.sparkContext.textFile("/home/prabhat/Documents/Spark/sampledata/competitor_data_10.txt")
##########removing headers
val x = rdd.mapPartitionsWithIndex{(idx,iter) => if(idx==0)iter.drop(1) else iter}
########## why RDD **x** here is comma separated    
x.map(x=>x.split("|")).take(10)
res74:Array[Array[String]=Array(数组(1,2,3,4,5,|,3,9,9,7,3,|,S,p,e,c,i,a,a,l,|,V,i,S,t,a,a,r,t,,,c,o,m,|,2,0,1,7,,,0,5,,,,1,1,5,:,3,9,:,3,0,”),数组(1,2,2,3,3,3,4,5,5,5,3,8,8,8,8,5,5,5,5,5,R,e,e,g,g,g,g,u,1,5,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,h,o,o,o,p,p,p,p,Y,Y,Y,Y,o,0,u,u,u,u,R,R,R,R,R,R,R,R,R,R,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,a,a,a,a,a,a,Y,Y,Y,Y,|,2,0,1,7,0,5,1,1,1,“,1,6,:,0,7,:,2,9,”),  ...
x、 map(x=>x.split(“|”).map(y=>Product(y(0).toInt,y(1).toDouble,y(2),y(3),y(4)).toDF.show
+---------+-----+---------+---------+-------+
|productId | price | saleEvent | rivalName | Fetchs|
+---------+-----+---------+---------+-------+
|        1|  2.0|        3|        4|      5|
|        1|  2.0|        3|        4|      5|
|        1|  2.0|        3|        4|      5|
|        4|  3.0|        1|        5|      7|
|        4|  3.0|        1|        5|      7|
|        4|  3.0|        1|        5|      7|
|        3|  6.0|        1|        3|      0|
|        3|  6.0|        1|        3|      0|
+---------+-----+---------+---------+-------+
为什么输出如上所述,它应该与此类似

+---------+-----+---------+---------+-------+
|productId | price | saleEvent | rivalName | Fetchs|
+---------+-----+---------+---------+-------+
|12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30|
+---------+-----+---------+---------+-------+

拆分采用正则表达式,因此使用
“\\\\\”
而不是
“\\\”

这将为您提供正确的结果

另外,如果您希望最终转换为数据帧,为什么不直接读为

spark.read
  .option("header", true)
  .option("delimiter", "|")
  .schema(Encoders.product[Product].schema)
  .csv("testfile.txt")
  .as[Product]
输出:

+---------+-------+---------+---------------+--------------------+
|productId|price  |saleEvent|rivalName      |fetchTS             |
+---------+-------+---------+---------------+--------------------+
|12345    |78.73  |Special  |VistaCart.com  |2017-05-11 15:39:30 |
|12345    |45.52  |Regular  |ShopYourWay.com|2017-05-11 16:09:43 |
|12345    |89.52  |Sale     |MarketPlace.com|2017-05-11 16:07:29 |
|678      |1348.73|Regular  |VistaCart.com  |2017-05-11 15:58:06 |
|678      |1348.73|Special  |ShopYourWay.com|2017-05-11 15:44:22 |
|678      |1232.29|Daily    |MarketPlace.com|2017-05-11 15:53:03 |
|777      |908.57 |Daily    |VistaCart.com  |2017-05-11 15:39:01 |
+---------+-------+---------+---------------+--------------------+

为什么不使用sqlContext将其直接读取到数据帧中?感谢您的快速响应。我使用sqlContext进行了读取,并且它正在工作。但我想了解在处理RDDAgree@user8371915时我在哪里出错。我的错误是,我没有正确搜索。感谢Highlighting感谢您的快速响应,它很有效。太棒了……还有w将RDD映射到产品对象时出错。>>>>>>>>>>>案例类产品(productId:Int,price:Double,saleEvent:String,rivalName:String,fetchTS:Timestamp)>>>>>>>scala>x.map(x=>x.split(\\\\\\”).map(y=>Product(y(y(0.toInt,y(1).toDouble,y(2),y(3),y(4)):34:错误:类型不匹配;找到:需要字符串:java.sql.Timestamp x.map(x=>x.split(“\\\\”).map(y=>Product(y(0).toInt,y(1).toDouble,y(2),y(3),y(4)))我在谷歌上搜索转换为时间戳你不能直接转换为时间戳,你需要将它解析为数据并转换好的谢谢你的回复
+---------+-------+---------+---------------+--------------------+
|productId|price  |saleEvent|rivalName      |fetchTS             |
+---------+-------+---------+---------------+--------------------+
|12345    |78.73  |Special  |VistaCart.com  |2017-05-11 15:39:30 |
|12345    |45.52  |Regular  |ShopYourWay.com|2017-05-11 16:09:43 |
|12345    |89.52  |Sale     |MarketPlace.com|2017-05-11 16:07:29 |
|678      |1348.73|Regular  |VistaCart.com  |2017-05-11 15:58:06 |
|678      |1348.73|Special  |ShopYourWay.com|2017-05-11 15:44:22 |
|678      |1232.29|Daily    |MarketPlace.com|2017-05-11 15:53:03 |
|777      |908.57 |Daily    |VistaCart.com  |2017-05-11 15:39:01 |
+---------+-------+---------+---------------+--------------------+