Java ApacheSparkSQL——分组方式
我有以下来自Rabbit MQ的JSON数据Java ApacheSparkSQL——分组方式,java,apache-spark,apache-spark-sql,Java,Apache Spark,Apache Spark Sql,我有以下来自Rabbit MQ的JSON数据 {"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:30","data":{"RunStatus":1"}} {"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:35","data":{"RunStatus":3"}} {"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:40","data"
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:30","data":{"RunStatus":1"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:35","data":{"RunStatus":3"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:40","data":{"RunStatus":2"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:45","data":{"RunStatus":3"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:50","data":{"RunStatus":2"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:35","data":{"RunStatus":1"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:45","data":{"RunStatus":3"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:50","data":{"RunStatus":2"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:55","data":{"RunStatus":3"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:22:00","data":{"RunStatus":2"}}
我试图获取设备所处的每个运行状态的持续时间,因此对于上面的数据,比如设备MACH-101,运行状态如下所示
在运行状态1下,设备处于-5秒(30-35)
在运行状态2下,设备处于-5秒(40-45)
在运行状态3下,设备处于-10秒(35-40+45-50)
上述逻辑同样适用于第二个设备数据
下面是我正在尝试的apachesparksql查询,但没有得到期望的结果。请提出一些备选方案;我也不介意以非SQL的方式进行
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("RabbitMqReceiver");
mconf.setMaster("local[*]");
jssc = new JavaStreamingContext(mconf,Durations.seconds(10));
SparkSession spksess = SparkSession
.builder()
.master("local[*]")
.appName("RabbitMqReceiver2")
.getOrCreate();
SQLContext sqlctxt = new SQLContext(spksess);
JavaDStream<String> strmData = jssc.receiverStream(new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
JavaDStream<String> machineData = strmData.window(Durations.minutes(1),Durations.seconds(10));
sqlctxt.udf().register("custdatediff", new UDF2<String, String, String>() {
@Override public String call(String argdt1,String argdt2) {
DateTimeFormatter formatter = DateTimeFormat.forPattern("dd-MM-yyyy HH:mm:ss");
DateTime dt1 = formatter.parseDateTime(argdt1);
DateTime dt2 = formatter.parseDateTime(argdt2);
Seconds retsec = org.joda.time.Seconds.secondsBetween(dt2, dt1);
return retsec.toString();
}
},DataTypes.StringType);
machineData.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) {
if(!rdd.isEmpty()){
Dataset<Row> df = sqlctxt.jsonRDD(rdd);
df.createOrReplaceTempView("DeviceData");
// I DONT WANT to GROUP by timestamp, but query requires I pass it.
Dataset<Row> searchResult = sqlctxt.sql("select t1.DeviceId,t1.data.runstatus,"
+ " custdatediff(CAST((t1.timestamp) as STRING),CAST((t2.timestamp) as STRING)) as duration from DeviceData t1"
+ " join DeviceData t2 on t1.DeviceId = t2.DeviceId group by t1.DeviceId,t1.data.runstatus,t1.timestamp,t2.timestamp");
searchResult.show();
}
}
});
jssc.start();
jssc.awaitTermination();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
您可以尝试使用Dataframe和Window函数。使用窗口函数中的“lead”,您可以比较当前行时间戳和下一行时间戳,并找到每个设备和运行状态的差异。 如下图所示
val windowSpec_wk = Window.partitionBy(df1("DeviceID")).orderBy(df1("timestamp"))
val df2 = df1.withColumn("period", lead(df1("timestamp"), 1).over(windowSpec_wk))
1.为此使用结构化流媒体。我还没有听说过RabbitMQ格式,但我认为写一个也不难(因为你有Spark流媒体)。2.你能详细说明一下“但我没有得到想要的结果”吗?你得到了什么?编辑问题以获取更多评论。谢谢更新添加到主要问题现在所做的是使用上述想法获得最终集合,从中我可以运行一个简单的java函数并获得结果。它工作得很好。