Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
分割数据帧Spark Scala_Scala_Apache Spark - Fatal编程技术网

分割数据帧Spark Scala

分割数据帧Spark Scala,scala,apache-spark,Scala,Apache Spark,我有一个叫做rankedDF的数据帧: +----------+-----------+----------+-------------+-----------------------+ |TimePeriod|TPStartDate| TPEndDate|TXN_HEADER_ID| Items| +----------+-----------+----------+-------------+-----------------------+ |

我有一个叫做rankedDF的数据帧:

+----------+-----------+----------+-------------+-----------------------+
|TimePeriod|TPStartDate| TPEndDate|TXN_HEADER_ID|               Items|
+----------+-----------+----------+-------------+-----------------------+

|         1| 2017-03-01|2017-05-30|   TxnHeader1|Womens Socks, Men...|

|         1| 2017-03-01|2017-05-30|   TxnHeader4|Mens Pants, Mens ...   |

|         1| 2017-03-01|2017-05-30|   TxnHeader7|Womens Socks, Men...|

|         2| 2019-03-01|2017-05-30|   TxnHeader1|Calcetas Mujer, Calc...|

|         2| 2019-03-01|2017-05-30|   TxnHeader4|Pantalones H, Pan ...  |

|         2| 2019-03-01|2017-05-30|   TxnHeader7|Calcetas Mujer, Pan...|
所以,我需要将这个数据帧按每个“时间段”分割,作为另一个函数的输入,但只使用列项

我试过这个:

val timePeriods = rankedDF.select(“TimePeriod”).distinct()
因此,在这一点上,我有:

| Time Period           |
|                     1 |
|                     2 |
根据这个“时间段”,我需要调用我的函数两次:

timePeriods.foreach{

 n=> val justItems = rankedDF.filter(col(“TimePeriod”)===n.getAsInt(0))                                         .select(“Items”)

}
嗯。。。我在等待这个数据帧:

|TimePeriod|
|Womens Socks, Men...
|Mens Pants, Mens ...   
|Womens Socks, Men...
相反,我得到了以下错误:

task 170.0 in stage 40.0 (TID 2223)
java.lang.NullPointerException
                at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
                at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
                at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
                at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
                at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
                at scala.collection.Iterator$class.foreach(Iterator.scala:727)
                at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
                at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
                at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
                at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
                at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
                at org.apache.spark.scheduler.Task.run(Task.scala:89)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)
18/04/24 11:49:32 WARN TaskSetManager: Lost task 170.0 in stage 40.0 (TID 2223, localhost): java.lang.NullPointerException
                at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
                at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
                at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
                at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
                at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
                at scala.collection.Iterator$class.foreach(Iterator.scala:727)
                at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
                at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
                at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
                at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
                at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
                at org.apache.spark.scheduler.Task.run(Task.scala:89)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)

18/04/24 11:49:32 ERROR TaskSetManager: Task 170 in stage 40.0 failed 1 times; aborting job

什么是我无法动态访问我的数据帧?

您需要先收集不同的值,然后才能使用
map

    val rankedDF : DataFrame = ???
    val timePeriods = rankedDF.select("TimePeriod").distinct().as[Int].collect()

    val dataFrames: Array[DataFrame] = timePeriods.map(tp => rankedDF.where(col("TimePeriod")===tp))

谢谢你,你的回答在我开发的第一部分非常有效,但是我被困在了其他的XD中
    val rankedDF : DataFrame = ???
    val timePeriods = rankedDF.select("TimePeriod").distinct().as[Int].collect()

    val dataFrames: Array[DataFrame] = timePeriods.map(tp => rankedDF.where(col("TimePeriod")===tp))