如何在Scala中追加每次更改的最后一条记录

如何在Scala中追加每次更改的最后一条记录,scala,apache-spark,scala-collections,Scala,Apache Spark,Scala Collections,我是Scala的新手,目前我正在做的是从大数据集中过滤数据并将其打印为csv。因此,我以这种格式打印的csv: id time status ___ _____ _________ 1 2016-10-09 00:09:10 100 1 2016-10-09 00:09:30

我是Scala的新手,目前我正在做的是从大数据集中过滤数据并将其打印为csv。因此,我以这种格式打印的csv:

id         time                              status
___        _____                            _________
1        2016-10-09 00:09:10                    100
1        2016-10-09 00:09:30                    100
1        2016-10-09 00:09:50                    100
1        2016-10-09 00:10:10                    900
2        2016-10-09 00:09:18                    100
2        2016-10-09 00:09:20                    100
2        2016-10-09 00:10:24                    900
3        2016-10-09 00:09:30                    100
3        2016-10-09 00:09:33                    100
3        2016-10-09 00:09:36                    100
3        2016-10-09 00:09:39                    100
3        2016-10-09 00:09:51                    900
我正在使用以下代码打印数据:

      var count=0;

      val StatusList = ListBuffer[String]();
       for (currentRow <- sortedRow) {
              if (currentRow.status==100){
                   StatusList.+=(currentRow.id+","+currentRow.time+","+currentRow.status)
                }
              if((count+1) <  sortedRow.size && sortedRow(count+1).status==900)   {
                   StatusList.+=(sortedRow(count+1).id+","+sortedRow(count+1).time+","+sortedRow(count+1).status)
                }
     count+=1;

    }

我建议您使用
dataframes
解决方案,这是为
RDD
s所做的优化和改进工作

我假设数据的格式如下,带有标题行

id,time,status
1,2016-10-0900:09:10,100
1,2016-10-0900:09:30,100
1,2016-10-0900:09:50,100
1,2016-10-0900:10:10,900
第一步是使用
sqlContext

 val sqlContext = sparkSession.sqlContext
 val dataframe = sqlContext.read.format("csv").option("header", "true").load("absolute path to the input file")
您应该有
dataframe
as

+---+------------------+------+
|id |time              |status|
+---+------------------+------+
|1  |2016-10-0900:09:10|100   |
|1  |2016-10-0900:09:30|100   |
|1  |2016-10-0900:09:50|100   |
|1  |2016-10-0900:10:10|900   |
|2  |2016-10-0900:09:18|100   |
|2  |2016-10-0900:09:20|100   |
|2  |2016-10-0900:10:24|900   |
|3  |2016-10-0900:09:30|100   |
|3  |2016-10-0900:09:33|100   |
|3  |2016-10-0900:09:36|100   |
|3  |2016-10-0900:09:39|100   |
|3  |2016-10-0900:09:51|900   |
+---+------------------+------+
下一步是将
数据帧
过滤成两个状态不同的
数据帧

val df1 = dataframe.filter(dataframe("status") === "100")
输出如下

+---+------------------+------+
|id |time              |status|
+---+------------------+------+
|1  |2016-10-0900:09:10|100   |
|1  |2016-10-0900:09:30|100   |
|1  |2016-10-0900:09:50|100   |
|2  |2016-10-0900:09:18|100   |
|2  |2016-10-0900:09:20|100   |
|3  |2016-10-0900:09:30|100   |
|3  |2016-10-0900:09:33|100   |
|3  |2016-10-0900:09:36|100   |
|3  |2016-10-0900:09:39|100   |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time              |status|id2|changed_time      |status2|
+---+------------------+------+---+------------------+-------+
|1  |2016-10-0900:09:10|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:30|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:50|100   |1  |2016-10-0900:10:10|900    |
|2  |2016-10-0900:09:18|100   |2  |2016-10-0900:10:24|900    |
|2  |2016-10-0900:09:20|100   |2  |2016-10-0900:10:24|900    |
|3  |2016-10-0900:09:30|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:33|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:36|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:39|100   |3  |2016-10-0900:09:51|900    |
+---+------------------+------+---+------------------+-------+
对于
df2
900
状态,请遵循相同的操作,但要重命名
名称

val df2 = dataframe.filter(dataframe("status") === "900")
  .withColumnRenamed("id", "id2")
  .withColumnRenamed("time", "changed_time")
  .withColumnRenamed("status", "status2")
输出应该是

+---+------------------+-------+
|id2|changed_time      |status2|
+---+------------------+-------+
|1  |2016-10-0900:10:10|900    |
|2  |2016-10-0900:10:24|900    |
|3  |2016-10-0900:09:51|900    |
+---+------------------+-------+
最后一步是将这两个数据帧连接起来

val finalDF = df1.join(df2, df1("id") === df2("id2"), "left")
最终输出如下所示

+---+------------------+------+
|id |time              |status|
+---+------------------+------+
|1  |2016-10-0900:09:10|100   |
|1  |2016-10-0900:09:30|100   |
|1  |2016-10-0900:09:50|100   |
|2  |2016-10-0900:09:18|100   |
|2  |2016-10-0900:09:20|100   |
|3  |2016-10-0900:09:30|100   |
|3  |2016-10-0900:09:33|100   |
|3  |2016-10-0900:09:36|100   |
|3  |2016-10-0900:09:39|100   |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time              |status|id2|changed_time      |status2|
+---+------------------+------+---+------------------+-------+
|1  |2016-10-0900:09:10|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:30|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:50|100   |1  |2016-10-0900:10:10|900    |
|2  |2016-10-0900:09:18|100   |2  |2016-10-0900:10:24|900    |
|2  |2016-10-0900:09:20|100   |2  |2016-10-0900:10:24|900    |
|3  |2016-10-0900:09:30|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:33|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:36|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:39|100   |3  |2016-10-0900:09:51|900    |
+---+------------------+------+---+------------------+-------+
将最终的
dataframe
保存到
csv
文件也非常简单

finalDF.write.format("csv").save("absolute path to output filename ")

我建议您使用
dataframes
解决方案,这是为
RDD
s所做的优化和改进工作

我假设数据的格式如下,带有标题行

id,time,status
1,2016-10-0900:09:10,100
1,2016-10-0900:09:30,100
1,2016-10-0900:09:50,100
1,2016-10-0900:10:10,900
第一步是使用
sqlContext

 val sqlContext = sparkSession.sqlContext
 val dataframe = sqlContext.read.format("csv").option("header", "true").load("absolute path to the input file")
您应该有
dataframe
as

+---+------------------+------+
|id |time              |status|
+---+------------------+------+
|1  |2016-10-0900:09:10|100   |
|1  |2016-10-0900:09:30|100   |
|1  |2016-10-0900:09:50|100   |
|1  |2016-10-0900:10:10|900   |
|2  |2016-10-0900:09:18|100   |
|2  |2016-10-0900:09:20|100   |
|2  |2016-10-0900:10:24|900   |
|3  |2016-10-0900:09:30|100   |
|3  |2016-10-0900:09:33|100   |
|3  |2016-10-0900:09:36|100   |
|3  |2016-10-0900:09:39|100   |
|3  |2016-10-0900:09:51|900   |
+---+------------------+------+
下一步是将
数据帧
过滤成两个状态不同的
数据帧

val df1 = dataframe.filter(dataframe("status") === "100")
输出如下

+---+------------------+------+
|id |time              |status|
+---+------------------+------+
|1  |2016-10-0900:09:10|100   |
|1  |2016-10-0900:09:30|100   |
|1  |2016-10-0900:09:50|100   |
|2  |2016-10-0900:09:18|100   |
|2  |2016-10-0900:09:20|100   |
|3  |2016-10-0900:09:30|100   |
|3  |2016-10-0900:09:33|100   |
|3  |2016-10-0900:09:36|100   |
|3  |2016-10-0900:09:39|100   |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time              |status|id2|changed_time      |status2|
+---+------------------+------+---+------------------+-------+
|1  |2016-10-0900:09:10|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:30|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:50|100   |1  |2016-10-0900:10:10|900    |
|2  |2016-10-0900:09:18|100   |2  |2016-10-0900:10:24|900    |
|2  |2016-10-0900:09:20|100   |2  |2016-10-0900:10:24|900    |
|3  |2016-10-0900:09:30|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:33|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:36|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:39|100   |3  |2016-10-0900:09:51|900    |
+---+------------------+------+---+------------------+-------+
对于
df2
900
状态,请遵循相同的操作,但要重命名
名称

val df2 = dataframe.filter(dataframe("status") === "900")
  .withColumnRenamed("id", "id2")
  .withColumnRenamed("time", "changed_time")
  .withColumnRenamed("status", "status2")
输出应该是

+---+------------------+-------+
|id2|changed_time      |status2|
+---+------------------+-------+
|1  |2016-10-0900:10:10|900    |
|2  |2016-10-0900:10:24|900    |
|3  |2016-10-0900:09:51|900    |
+---+------------------+-------+
最后一步是将这两个数据帧连接起来

val finalDF = df1.join(df2, df1("id") === df2("id2"), "left")
最终输出如下所示

+---+------------------+------+
|id |time              |status|
+---+------------------+------+
|1  |2016-10-0900:09:10|100   |
|1  |2016-10-0900:09:30|100   |
|1  |2016-10-0900:09:50|100   |
|2  |2016-10-0900:09:18|100   |
|2  |2016-10-0900:09:20|100   |
|3  |2016-10-0900:09:30|100   |
|3  |2016-10-0900:09:33|100   |
|3  |2016-10-0900:09:36|100   |
|3  |2016-10-0900:09:39|100   |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time              |status|id2|changed_time      |status2|
+---+------------------+------+---+------------------+-------+
|1  |2016-10-0900:09:10|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:30|100   |1  |2016-10-0900:10:10|900    |
|1  |2016-10-0900:09:50|100   |1  |2016-10-0900:10:10|900    |
|2  |2016-10-0900:09:18|100   |2  |2016-10-0900:10:24|900    |
|2  |2016-10-0900:09:20|100   |2  |2016-10-0900:10:24|900    |
|3  |2016-10-0900:09:30|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:33|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:36|100   |3  |2016-10-0900:09:51|900    |
|3  |2016-10-0900:09:39|100   |3  |2016-10-0900:09:51|900    |
+---+------------------+------+---+------------------+-------+
将最终的
dataframe
保存到
csv
文件也非常简单

finalDF.write.format("csv").save("absolute path to output filename ")

您可以将这两个状态分为两个CSV,但是附加的规则是什么?随机追加还是有一些严格的追加规则?我可以分开,但为了进一步分析,我需要保留上述格式您没有仔细阅读我的问题。我问组合的规则是什么?实际上,整个数据是一个序列,其中状态为100,第一次更改为900(对于每个id)。因此,我需要根据每个id添加第一个100到900的更改。这正是预期的输出。对于误导性的输出,很抱歉。您可以将这两个状态分隔为两个CSV,但添加的规则是什么?随机追加还是有一些严格的追加规则?我可以分开,但为了进一步分析,我需要保留上述格式您没有仔细阅读我的问题。我问组合的规则是什么?实际上,整个数据是一个序列,其中状态为100,第一次更改为900(对于每个id)。因此,我需要根据每个id将第一次更改100添加到900。这正是预期的输出。对于误导性的输出,我深表歉意