分割数据帧Spark Scala
我有一个叫做rankedDF的数据帧:分割数据帧Spark Scala,scala,apache-spark,Scala,Apache Spark,我有一个叫做rankedDF的数据帧: +----------+-----------+----------+-------------+-----------------------+ |TimePeriod|TPStartDate| TPEndDate|TXN_HEADER_ID| Items| +----------+-----------+----------+-------------+-----------------------+ |
+----------+-----------+----------+-------------+-----------------------+
|TimePeriod|TPStartDate| TPEndDate|TXN_HEADER_ID| Items|
+----------+-----------+----------+-------------+-----------------------+
| 1| 2017-03-01|2017-05-30| TxnHeader1|Womens Socks, Men...|
| 1| 2017-03-01|2017-05-30| TxnHeader4|Mens Pants, Mens ... |
| 1| 2017-03-01|2017-05-30| TxnHeader7|Womens Socks, Men...|
| 2| 2019-03-01|2017-05-30| TxnHeader1|Calcetas Mujer, Calc...|
| 2| 2019-03-01|2017-05-30| TxnHeader4|Pantalones H, Pan ... |
| 2| 2019-03-01|2017-05-30| TxnHeader7|Calcetas Mujer, Pan...|
所以,我需要将这个数据帧按每个“时间段”分割,作为另一个函数的输入,但只使用列项
我试过这个:
val timePeriods = rankedDF.select(“TimePeriod”).distinct()
因此,在这一点上,我有:
| Time Period |
| 1 |
| 2 |
根据这个“时间段”,我需要调用我的函数两次:
timePeriods.foreach{
n=> val justItems = rankedDF.filter(col(“TimePeriod”)===n.getAsInt(0)) .select(“Items”)
}
嗯。。。我在等待这个数据帧:
|TimePeriod|
|Womens Socks, Men...
|Mens Pants, Mens ...
|Womens Socks, Men...
相反,我得到了以下错误:
task 170.0 in stage 40.0 (TID 2223)
java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/04/24 11:49:32 WARN TaskSetManager: Lost task 170.0 in stage 40.0 (TID 2223, localhost): java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/04/24 11:49:32 ERROR TaskSetManager: Task 170 in stage 40.0 failed 1 times; aborting job
什么是我无法动态访问我的数据帧?您需要先收集不同的值,然后才能使用
map
:
val rankedDF : DataFrame = ???
val timePeriods = rankedDF.select("TimePeriod").distinct().as[Int].collect()
val dataFrames: Array[DataFrame] = timePeriods.map(tp => rankedDF.where(col("TimePeriod")===tp))
谢谢你,你的回答在我开发的第一部分非常有效,但是我被困在了其他的XD中
val rankedDF : DataFrame = ???
val timePeriods = rankedDF.select("TimePeriod").distinct().as[Int].collect()
val dataFrames: Array[DataFrame] = timePeriods.map(tp => rankedDF.where(col("TimePeriod")===tp))