如何在Scala中追加每次更改的最后一条记录
我是Scala的新手,目前我正在做的是从大数据集中过滤数据并将其打印为csv。因此,我以这种格式打印的csv:如何在Scala中追加每次更改的最后一条记录,scala,apache-spark,scala-collections,Scala,Apache Spark,Scala Collections,我是Scala的新手,目前我正在做的是从大数据集中过滤数据并将其打印为csv。因此,我以这种格式打印的csv: id time status ___ _____ _________ 1 2016-10-09 00:09:10 100 1 2016-10-09 00:09:30
id time status
___ _____ _________
1 2016-10-09 00:09:10 100
1 2016-10-09 00:09:30 100
1 2016-10-09 00:09:50 100
1 2016-10-09 00:10:10 900
2 2016-10-09 00:09:18 100
2 2016-10-09 00:09:20 100
2 2016-10-09 00:10:24 900
3 2016-10-09 00:09:30 100
3 2016-10-09 00:09:33 100
3 2016-10-09 00:09:36 100
3 2016-10-09 00:09:39 100
3 2016-10-09 00:09:51 900
我正在使用以下代码打印数据:
var count=0;
val StatusList = ListBuffer[String]();
for (currentRow <- sortedRow) {
if (currentRow.status==100){
StatusList.+=(currentRow.id+","+currentRow.time+","+currentRow.status)
}
if((count+1) < sortedRow.size && sortedRow(count+1).status==900) {
StatusList.+=(sortedRow(count+1).id+","+sortedRow(count+1).time+","+sortedRow(count+1).status)
}
count+=1;
}
我建议您使用
dataframes
解决方案,这是为RDD
s所做的优化和改进工作
我假设数据的格式如下,带有标题行
id,time,status
1,2016-10-0900:09:10,100
1,2016-10-0900:09:30,100
1,2016-10-0900:09:50,100
1,2016-10-0900:10:10,900
第一步是使用sqlContext
val sqlContext = sparkSession.sqlContext
val dataframe = sqlContext.read.format("csv").option("header", "true").load("absolute path to the input file")
您应该有dataframe
as
+---+------------------+------+
|id |time |status|
+---+------------------+------+
|1 |2016-10-0900:09:10|100 |
|1 |2016-10-0900:09:30|100 |
|1 |2016-10-0900:09:50|100 |
|1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:09:18|100 |
|2 |2016-10-0900:09:20|100 |
|2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:30|100 |
|3 |2016-10-0900:09:33|100 |
|3 |2016-10-0900:09:36|100 |
|3 |2016-10-0900:09:39|100 |
|3 |2016-10-0900:09:51|900 |
+---+------------------+------+
下一步是将数据帧
过滤成两个状态不同的数据帧
val df1 = dataframe.filter(dataframe("status") === "100")
输出如下
+---+------------------+------+
|id |time |status|
+---+------------------+------+
|1 |2016-10-0900:09:10|100 |
|1 |2016-10-0900:09:30|100 |
|1 |2016-10-0900:09:50|100 |
|2 |2016-10-0900:09:18|100 |
|2 |2016-10-0900:09:20|100 |
|3 |2016-10-0900:09:30|100 |
|3 |2016-10-0900:09:33|100 |
|3 |2016-10-0900:09:36|100 |
|3 |2016-10-0900:09:39|100 |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time |status|id2|changed_time |status2|
+---+------------------+------+---+------------------+-------+
|1 |2016-10-0900:09:10|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:30|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:50|100 |1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:09:18|100 |2 |2016-10-0900:10:24|900 |
|2 |2016-10-0900:09:20|100 |2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:30|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:33|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:36|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:39|100 |3 |2016-10-0900:09:51|900 |
+---+------------------+------+---+------------------+-------+
对于df2
的900
状态,请遵循相同的操作,但要重命名列
名称
val df2 = dataframe.filter(dataframe("status") === "900")
.withColumnRenamed("id", "id2")
.withColumnRenamed("time", "changed_time")
.withColumnRenamed("status", "status2")
输出应该是
+---+------------------+-------+
|id2|changed_time |status2|
+---+------------------+-------+
|1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:51|900 |
+---+------------------+-------+
最后一步是将这两个数据帧连接起来
val finalDF = df1.join(df2, df1("id") === df2("id2"), "left")
最终输出如下所示
+---+------------------+------+
|id |time |status|
+---+------------------+------+
|1 |2016-10-0900:09:10|100 |
|1 |2016-10-0900:09:30|100 |
|1 |2016-10-0900:09:50|100 |
|2 |2016-10-0900:09:18|100 |
|2 |2016-10-0900:09:20|100 |
|3 |2016-10-0900:09:30|100 |
|3 |2016-10-0900:09:33|100 |
|3 |2016-10-0900:09:36|100 |
|3 |2016-10-0900:09:39|100 |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time |status|id2|changed_time |status2|
+---+------------------+------+---+------------------+-------+
|1 |2016-10-0900:09:10|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:30|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:50|100 |1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:09:18|100 |2 |2016-10-0900:10:24|900 |
|2 |2016-10-0900:09:20|100 |2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:30|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:33|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:36|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:39|100 |3 |2016-10-0900:09:51|900 |
+---+------------------+------+---+------------------+-------+
将最终的dataframe
保存到csv
文件也非常简单
finalDF.write.format("csv").save("absolute path to output filename ")
我建议您使用dataframes
解决方案,这是为RDD
s所做的优化和改进工作
我假设数据的格式如下,带有标题行
id,time,status
1,2016-10-0900:09:10,100
1,2016-10-0900:09:30,100
1,2016-10-0900:09:50,100
1,2016-10-0900:10:10,900
第一步是使用sqlContext
val sqlContext = sparkSession.sqlContext
val dataframe = sqlContext.read.format("csv").option("header", "true").load("absolute path to the input file")
您应该有dataframe
as
+---+------------------+------+
|id |time |status|
+---+------------------+------+
|1 |2016-10-0900:09:10|100 |
|1 |2016-10-0900:09:30|100 |
|1 |2016-10-0900:09:50|100 |
|1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:09:18|100 |
|2 |2016-10-0900:09:20|100 |
|2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:30|100 |
|3 |2016-10-0900:09:33|100 |
|3 |2016-10-0900:09:36|100 |
|3 |2016-10-0900:09:39|100 |
|3 |2016-10-0900:09:51|900 |
+---+------------------+------+
下一步是将数据帧
过滤成两个状态不同的数据帧
val df1 = dataframe.filter(dataframe("status") === "100")
输出如下
+---+------------------+------+
|id |time |status|
+---+------------------+------+
|1 |2016-10-0900:09:10|100 |
|1 |2016-10-0900:09:30|100 |
|1 |2016-10-0900:09:50|100 |
|2 |2016-10-0900:09:18|100 |
|2 |2016-10-0900:09:20|100 |
|3 |2016-10-0900:09:30|100 |
|3 |2016-10-0900:09:33|100 |
|3 |2016-10-0900:09:36|100 |
|3 |2016-10-0900:09:39|100 |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time |status|id2|changed_time |status2|
+---+------------------+------+---+------------------+-------+
|1 |2016-10-0900:09:10|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:30|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:50|100 |1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:09:18|100 |2 |2016-10-0900:10:24|900 |
|2 |2016-10-0900:09:20|100 |2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:30|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:33|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:36|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:39|100 |3 |2016-10-0900:09:51|900 |
+---+------------------+------+---+------------------+-------+
对于df2
的900
状态,请遵循相同的操作,但要重命名列
名称
val df2 = dataframe.filter(dataframe("status") === "900")
.withColumnRenamed("id", "id2")
.withColumnRenamed("time", "changed_time")
.withColumnRenamed("status", "status2")
输出应该是
+---+------------------+-------+
|id2|changed_time |status2|
+---+------------------+-------+
|1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:51|900 |
+---+------------------+-------+
最后一步是将这两个数据帧连接起来
val finalDF = df1.join(df2, df1("id") === df2("id2"), "left")
最终输出如下所示
+---+------------------+------+
|id |time |status|
+---+------------------+------+
|1 |2016-10-0900:09:10|100 |
|1 |2016-10-0900:09:30|100 |
|1 |2016-10-0900:09:50|100 |
|2 |2016-10-0900:09:18|100 |
|2 |2016-10-0900:09:20|100 |
|3 |2016-10-0900:09:30|100 |
|3 |2016-10-0900:09:33|100 |
|3 |2016-10-0900:09:36|100 |
|3 |2016-10-0900:09:39|100 |
+---+------------------+------+
+---+------------------+------+---+------------------+-------+
|id |time |status|id2|changed_time |status2|
+---+------------------+------+---+------------------+-------+
|1 |2016-10-0900:09:10|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:30|100 |1 |2016-10-0900:10:10|900 |
|1 |2016-10-0900:09:50|100 |1 |2016-10-0900:10:10|900 |
|2 |2016-10-0900:09:18|100 |2 |2016-10-0900:10:24|900 |
|2 |2016-10-0900:09:20|100 |2 |2016-10-0900:10:24|900 |
|3 |2016-10-0900:09:30|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:33|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:36|100 |3 |2016-10-0900:09:51|900 |
|3 |2016-10-0900:09:39|100 |3 |2016-10-0900:09:51|900 |
+---+------------------+------+---+------------------+-------+
将最终的dataframe
保存到csv
文件也非常简单
finalDF.write.format("csv").save("absolute path to output filename ")
您可以将这两个状态分为两个CSV,但是附加的规则是什么?随机追加还是有一些严格的追加规则?我可以分开,但为了进一步分析,我需要保留上述格式您没有仔细阅读我的问题。我问组合的规则是什么?实际上,整个数据是一个序列,其中状态为100,第一次更改为900(对于每个id)。因此,我需要根据每个id添加第一个100到900的更改。这正是预期的输出。对于误导性的输出,很抱歉。您可以将这两个状态分隔为两个CSV,但添加的规则是什么?随机追加还是有一些严格的追加规则?我可以分开,但为了进一步分析,我需要保留上述格式您没有仔细阅读我的问题。我问组合的规则是什么?实际上,整个数据是一个序列,其中状态为100,第一次更改为900(对于每个id)。因此,我需要根据每个id将第一次更改100添加到900。这正是预期的输出。对于误导性的输出,我深表歉意