Scala 如何按“拆分字符串”|&引用;(管道)并将RDD转换为数据帧
我试图读取一个包含产品信息的文本文件,该文件是分开的。当我试图将数据读取为RDD,然后使用分隔符Scala 如何按“拆分字符串”|&引用;(管道)并将RDD转换为数据帧,scala,apache-spark,apache-spark-sql,rdd,Scala,Apache Spark,Apache Spark Sql,Rdd,我试图读取一个包含产品信息的文本文件,该文件是分开的。当我试图将数据读取为RDD,然后使用分隔符|拆分数据时,数据就会损坏。我无法理解为什么会发生这种情况 ####输入数据 productId | price | saleEvent | rivalName | fetchs 12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30 12345 | 45.52 |普通| ShopYourWay.com | 2017-05-11 16
|
拆分数据时,数据就会损坏。我无法理解为什么会发生这种情况
####输入数据
productId | price | saleEvent | rivalName | fetchs
12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30
12345 | 45.52 |普通| ShopYourWay.com | 2017-05-11 16:09:43
12345 | 89.52 | Sale | MarketPlace.com | 2017-05-11 16:07:29
678 | 1348.73 |常规| VistaCart.com | 2017-05-11 15:58:06
678 | 1348.73 | Special | ShopYourWay.com | 2017-05-11 15:44:22
678 | 1232.29 | Daily | MarketPlace.com | 2017-05-11 15:53:03
777 | 908.57 | Daily | VistaCart.com | 2017-05-11 15:39:01
####火花壳代码
import org.apache.spark.sql.Encoder; import spark.implicits._
import org.apache.spark.sql.Encoder
case class Product(productId:Int, price:Double, saleEvent:String, rivalName:String, fetchTS:String)
val rdd = spark.sparkContext.textFile("/home/prabhat/Documents/Spark/sampledata/competitor_data_10.txt")
##########removing headers
val x = rdd.mapPartitionsWithIndex{(idx,iter) => if(idx==0)iter.drop(1) else iter}
########## why RDD **x** here is comma separated
x.map(x=>x.split("|")).take(10)
res74:Array[Array[String]=Array(数组(1,2,3,4,5,|,3,9,9,7,3,|,S,p,e,c,i,a,a,l,|,V,i,S,t,a,a,r,t,,,c,o,m,|,2,0,1,7,,,0,5,,,,1,1,5,:,3,9,:,3,0,”),数组(1,2,2,3,3,3,4,5,5,5,3,8,8,8,8,5,5,5,5,5,R,e,e,g,g,g,g,u,1,5,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,h,o,o,o,p,p,p,p,Y,Y,Y,Y,o,0,u,u,u,u,R,R,R,R,R,R,R,R,R,R,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,a,a,a,a,a,a,Y,Y,Y,Y,|,2,0,1,7,0,5,1,1,1,“,1,6,:,0,7,:,2,9,”), ...
x、 map(x=>x.split(“|”).map(y=>Product(y(0).toInt,y(1).toDouble,y(2),y(3),y(4)).toDF.show
+---------+-----+---------+---------+-------+
|productId | price | saleEvent | rivalName | Fetchs|
+---------+-----+---------+---------+-------+
| 1| 2.0| 3| 4| 5|
| 1| 2.0| 3| 4| 5|
| 1| 2.0| 3| 4| 5|
| 4| 3.0| 1| 5| 7|
| 4| 3.0| 1| 5| 7|
| 4| 3.0| 1| 5| 7|
| 3| 6.0| 1| 3| 0|
| 3| 6.0| 1| 3| 0|
+---------+-----+---------+---------+-------+
为什么输出如上所述,它应该与此类似
+---------+-----+---------+---------+-------+
|productId | price | saleEvent | rivalName | Fetchs|
+---------+-----+---------+---------+-------+
|12345 | 78.73 | Special | VistaCart.com | 2017-05-11 15:39:30|
+---------+-----+---------+---------+-------+
拆分采用正则表达式,因此使用“\\\\\”
而不是“\\\”
这将为您提供正确的结果
另外,如果您希望最终转换为数据帧,为什么不直接读为
spark.read
.option("header", true)
.option("delimiter", "|")
.schema(Encoders.product[Product].schema)
.csv("testfile.txt")
.as[Product]
输出:
+---------+-------+---------+---------------+--------------------+
|productId|price |saleEvent|rivalName |fetchTS |
+---------+-------+---------+---------------+--------------------+
|12345 |78.73 |Special |VistaCart.com |2017-05-11 15:39:30 |
|12345 |45.52 |Regular |ShopYourWay.com|2017-05-11 16:09:43 |
|12345 |89.52 |Sale |MarketPlace.com|2017-05-11 16:07:29 |
|678 |1348.73|Regular |VistaCart.com |2017-05-11 15:58:06 |
|678 |1348.73|Special |ShopYourWay.com|2017-05-11 15:44:22 |
|678 |1232.29|Daily |MarketPlace.com|2017-05-11 15:53:03 |
|777 |908.57 |Daily |VistaCart.com |2017-05-11 15:39:01 |
+---------+-------+---------+---------------+--------------------+
为什么不使用sqlContext将其直接读取到数据帧中?感谢您的快速响应。我使用sqlContext进行了读取,并且它正在工作。但我想了解在处理RDDAgree@user8371915时我在哪里出错。我的错误是,我没有正确搜索。感谢Highlighting感谢您的快速响应,它很有效。太棒了……还有w将RDD映射到产品对象时出错。>>>>>>>>>>>案例类产品(productId:Int,price:Double,saleEvent:String,rivalName:String,fetchTS:Timestamp)>>>>>>>scala>x.map(x=>x.split(\\\\\\”).map(y=>Product(y(y(0.toInt,y(1).toDouble,y(2),y(3),y(4)):34:错误:类型不匹配;找到:需要字符串:java.sql.Timestamp x.map(x=>x.split(“\\\\”).map(y=>Product(y(0).toInt,y(1).toDouble,y(2),y(3),y(4)))我在谷歌上搜索转换为时间戳你不能直接转换为时间戳,你需要将它解析为数据并转换好的谢谢你的回复
+---------+-------+---------+---------------+--------------------+
|productId|price |saleEvent|rivalName |fetchTS |
+---------+-------+---------+---------------+--------------------+
|12345 |78.73 |Special |VistaCart.com |2017-05-11 15:39:30 |
|12345 |45.52 |Regular |ShopYourWay.com|2017-05-11 16:09:43 |
|12345 |89.52 |Sale |MarketPlace.com|2017-05-11 16:07:29 |
|678 |1348.73|Regular |VistaCart.com |2017-05-11 15:58:06 |
|678 |1348.73|Special |ShopYourWay.com|2017-05-11 15:44:22 |
|678 |1232.29|Daily |MarketPlace.com|2017-05-11 15:53:03 |
|777 |908.57 |Daily |VistaCart.com |2017-05-11 15:39:01 |
+---------+-------+---------+---------------+--------------------+