Scala 如何按“拆分字符串”|&引用；（管道）并将RDD转换为数据帧_Scala_Apache Spark_Apache Spark Sql_Rdd

Scala 如何按“拆分字符串”|&引用；（管道）并将RDD转换为数据帧

scala apache-spark

Scala 如何按“拆分字符串”|&引用；（管道）并将RDD转换为数据帧,scala,apache-spark,apache-spark-sql,rdd,Scala,Apache Spark,Apache Spark Sql,Rdd,我试图读取一个包含产品信息的文本文件，该文件是分开的。当我试图将数据读取为RDD，然后使用分隔符|拆分数据时，数据就会损坏。我无法理解为什么会发生这种情况 ####输入数据 productId | price | saleEvent | rivalName | fetchs 12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30 12345 | 45.52 |普通| ShopYourWay.com | 2017-05-11 16

我试图读取一个包含产品信息的文本文件，该文件是分开的。当我试图将数据读取为RDD，然后使用分隔符

拆分数据时，数据就会损坏。我无法理解为什么会发生这种情况

####输入数据

productId | price | saleEvent | rivalName | fetchs
12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30
12345 | 45.52 |普通| ShopYourWay.com | 2017-05-11 16:09:43
12345 | 89.52 | Sale | MarketPlace.com | 2017-05-11 16:07:29
678 | 1348.73 |常规| VistaCart.com | 2017-05-11 15:58:06
678 | 1348.73 | Special | ShopYourWay.com | 2017-05-11 15:44:22
678 | 1232.29 | Daily | MarketPlace.com | 2017-05-11 15:53:03
777 | 908.57 | Daily | VistaCart.com | 2017-05-11 15:39:01

####火花壳代码

import org.apache.spark.sql.Encoder; import spark.implicits._
import org.apache.spark.sql.Encoder

case class Product(productId:Int, price:Double, saleEvent:String, rivalName:String, fetchTS:String)
val rdd = spark.sparkContext.textFile("/home/prabhat/Documents/Spark/sampledata/competitor_data_10.txt")
##########removing headers
val x = rdd.mapPartitionsWithIndex{(idx,iter) => if(idx==0)iter.drop(1) else iter}
########## why RDD **x** here is comma separated    
x.map(x=>x.split("|")).take(10)

res74:Array[Array[String]=Array（数组（1，2，3，4，5，|，3，9，9，7，3，|，S，p，e，c，i，a，a，l，|，V，i，S，t，a，a，r，t，，，c，o，m，|，2，0，1，7，，，0，5，，，，1，1，5，：，3，9，：，3，0，”），数组（1，2，2，3，3，3，4，5，5，5，3，8，8，8，8，5，5，5，5，5，R，e，e，g，g，g，g，u，1，5，2，2，2，3，3，3，3，3，3，3，3，3，3，3，3，3，3，3，3，0，3，h，o，o，o，p，p，p，p，Y，Y，Y，Y，o，0，u，u，u，u，R，R，R，R，R，R，R，R，R，R，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，W，a，a，a，a，a，a，Y，Y，Y，Y，|，2，0，1，7，0，5，1，1，1，“，1，6，：，0，7，：，2，9，”）,  ...
x、 map（x=>x.split（“|”）.map（y=>Product（y（0）.toInt，y（1）.toDouble，y（2），y（3），y（4））.toDF.show

+---------+-----+---------+---------+-------+
|productId | price | saleEvent | rivalName | Fetchs|
+---------+-----+---------+---------+-------+
|        1|  2.0|        3|        4|      5|
|        1|  2.0|        3|        4|      5|
|        1|  2.0|        3|        4|      5|
|        4|  3.0|        1|        5|      7|
|        4|  3.0|        1|        5|      7|
|        4|  3.0|        1|        5|      7|
|        3|  6.0|        1|        3|      0|
|        3|  6.0|        1|        3|      0|
+---------+-----+---------+---------+-------+

为什么输出如上所述，它应该与此类似

+---------+-----+---------+---------+-------+
|productId | price | saleEvent | rivalName | Fetchs|
+---------+-----+---------+---------+-------+
|12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30|
+---------+-----+---------+---------+-------+

拆分采用正则表达式，因此使用

“\\\\\”

而不是

“\\\”

这将为您提供正确的结果

另外，如果您希望最终转换为数据帧，为什么不直接读为

spark.read
  .option("header", true)
  .option("delimiter", "|")
  .schema(Encoders.product[Product].schema)
  .csv("testfile.txt")
  .as[Product]

输出：

+---------+-------+---------+---------------+--------------------+
|productId|price  |saleEvent|rivalName      |fetchTS             |
+---------+-------+---------+---------------+--------------------+
|12345    |78.73  |Special  |VistaCart.com  |2017-05-11 15:39:30 |
|12345    |45.52  |Regular  |ShopYourWay.com|2017-05-11 16:09:43 |
|12345    |89.52  |Sale     |MarketPlace.com|2017-05-11 16:07:29 |
|678      |1348.73|Regular  |VistaCart.com  |2017-05-11 15:58:06 |
|678      |1348.73|Special  |ShopYourWay.com|2017-05-11 15:44:22 |
|678      |1232.29|Daily    |MarketPlace.com|2017-05-11 15:53:03 |
|777      |908.57 |Daily    |VistaCart.com  |2017-05-11 15:39:01 |
+---------+-------+---------+---------------+--------------------+

为什么不使用sqlContext将其直接读取到数据帧中？感谢您的快速响应。我使用sqlContext进行了读取，并且它正在工作。但我想了解在处理RDDAgree@user8371915时我在哪里出错。我的错误是，我没有正确搜索。感谢Highlighting感谢您的快速响应，它很有效。太棒了……还有w将RDD映射到产品对象时出错。>>>>>>>>>>>案例类产品（productId:Int，price:Double，saleEvent:String，rivalName:String，fetchTS:Timestamp）>>>>>>>scala>x.map（x=>x.split（\\\\\\”）.map（y=>Product（y（y（0.toInt，y（1）.toDouble，y（2），y（3），y（4））：34：错误：类型不匹配；找到：需要字符串：java.sql.Timestamp x.map（x=>x.split（“\\\\”）.map（y=>Product（y（0）.toInt，y（1）.toDouble，y（2），y（3），y（4）））我在谷歌上搜索转换为时间戳你不能直接转换为时间戳，你需要将它解析为数据并转换好的谢谢你的回复

+---------+-------+---------+---------------+--------------------+
|productId|price  |saleEvent|rivalName      |fetchTS             |
+---------+-------+---------+---------------+--------------------+
|12345    |78.73  |Special  |VistaCart.com  |2017-05-11 15:39:30 |
|12345    |45.52  |Regular  |ShopYourWay.com|2017-05-11 16:09:43 |
|12345    |89.52  |Sale     |MarketPlace.com|2017-05-11 16:07:29 |
|678      |1348.73|Regular  |VistaCart.com  |2017-05-11 15:58:06 |
|678      |1348.73|Special  |ShopYourWay.com|2017-05-11 15:44:22 |
|678      |1232.29|Daily    |MarketPlace.com|2017-05-11 15:53:03 |
|777      |908.57 |Daily    |VistaCart.com  |2017-05-11 15:39:01 |
+---------+-------+---------+---------------+--------------------+