Apache spark 减少火花_Apache Spark_Rdd_Reduce

Apache spark 减少火花

apache-spark

Apache spark 减少火花,apache-spark,rdd,reduce,Apache Spark,Rdd,Reduce,我有如下流数据 id, date, value i1, 12-01-2016, 10 i2, 12-02-2016, 20 i1, 12-01-2016, 30 i2, 12-05-2016, 40 希望按id减少以按日期获取聚合值信息，如 rdd所需的输出针对给定的id和列表（365天）我必须根据一年中的某一天（如2016年1月12日）将该值置于列表位置，因为设备i1有两个实例的日期相同，因此应将它们聚合 id, List [0|1|2|3|... |336|

我有如下流数据

id, date, value
i1, 12-01-2016, 10
i2, 12-02-2016, 20
i1, 12-01-2016, 30
i2, 12-05-2016, 40

希望按id减少以按日期获取聚合值信息，如

rdd所需的输出针对给定的id和列表（365天）我必须根据一年中的某一天（如2016年1月12日）将该值置于列表位置，因为设备i1有两个实例的日期相同，因此应将它们聚合

id, List [0|1|2|3|...              |336|  337|  |340|  |365]
i1,                                |10+30|        - this goes to 336 position

i2,                                       20     40 -- this goes to 337 and 340 position

请引导reduce或group by transformation来完成此操作。

我将提供基本的代码片段，其中包含一些假设，因为您尚未指定语言、数据源或数据格式

JavaDStream<String> lineStream = //Your data source for stream
JavaPairDStream<String, Long> firstReduce = lineStream.mapToPair(line -> {
    String[] fields = line.split(",");
    String idDate = fields[0] + fields[1];
    Long value = Long.valueOf(fields[2]);
    return new Tuple2<String, Long>(idDate, value);
}).reduceByKey((v1, v2) -> {
    return (v1+v2);
});
firstReduce.map(idDateValueTuple -> {
    String idDate = idDateValueTuple._1();
    Long valueSum = idDateValueTuple._2();
    String id = idDate.split(",")[0];
    String date = idDate.split(",")[];
    //TODO parse date and put the sumValue in array as you wish
}

JavaDStream lineStream=//流的数据源
JavaPairDStream firstReduce=lineStream.mapToPair（行->{
String[]fields=line.split（“，”）；
字符串idDate=字段[0]+字段[1]；
Long value=Long.valueOf（字段[2]）；
返回新的Tuple2（idDate，value）；
}).reduceByKey（（v1，v2）->{
返回（v1+v2）；
});
firstReduce.map（idDateValueTuple->{
字符串idDate=idDateValueTuple._1（）；
Long valueSum=idDateValueTuple._2（）；
字符串id=idDate.split（“，”[0]；
字符串日期=idDate.split（“，”[”）；
//TODO解析日期并根据需要将sumValue放入数组中
}

我将为您提供基本的代码片段，其中包含一些假设，因为您尚未指定语言、数据源或数据格式

JavaDStream<String> lineStream = //Your data source for stream
JavaPairDStream<String, Long> firstReduce = lineStream.mapToPair(line -> {
    String[] fields = line.split(",");
    String idDate = fields[0] + fields[1];
    Long value = Long.valueOf(fields[2]);
    return new Tuple2<String, Long>(idDate, value);
}).reduceByKey((v1, v2) -> {
    return (v1+v2);
});
firstReduce.map(idDateValueTuple -> {
    String idDate = idDateValueTuple._1();
    Long valueSum = idDateValueTuple._2();
    String id = idDate.split(",")[0];
    String date = idDate.split(",")[];
    //TODO parse date and put the sumValue in array as you wish
}

JavaDStream lineStream=//流的数据源
JavaPairDStream firstReduce=lineStream.mapToPair（行->{
String[]fields=line.split（“，”）；
字符串idDate=字段[0]+字段[1]；
Long value=Long.valueOf（字段[2]）；
返回新的Tuple2（idDate，value）；
}).reduceByKey（（v1，v2）->{
返回（v1+v2）；
});
firstReduce.map（idDateValueTuple->{
字符串idDate=idDateValueTuple._1（）；
Long valueSum=idDateValueTuple._2（）；
字符串id=idDate.split（“，”[0]；
字符串日期=idDate.split（“，”[”）；
//TODO解析日期并根据需要将sumValue放入数组中
}

只能做到这一步。我不知道如何在最后一步添加数组的每个元素。希望这能有所帮助！！！如果您得到最后一步或其他方法，请将其张贴在此处，不胜感激

def getDateDifference(dateStr:String):Int = {
val startDate = "01-01-2016" 
val formatter = DateTimeFormatter.ofPattern("MM-dd-yyyy")
val oldDate = LocalDate.parse(startDate, formatter)
val currentDate = dateStr
val newDate = LocalDate.parse(currentDate, formatter)
return newDate.toEpochDay().toInt - oldDate.toEpochDay().toInt
}
def getArray(numberofDays:Int,data:Int):Iterable[Int] = {
val daysArray = new Array[Int](366)
daysArray(numberofDays) = data
return daysArray
}
val idRDD = <read from stream>
val idRDDMap = idRDD.map { rec => ((rec.split(",")(0),rec.split(",")(1)),
        (getDateDifference(rec.split(",")(1)),rec.split(",")(2).toInt))}
val idRDDconsiceMap = idRDDMap.map { rec => (rec._1._1,getArray(rec._2._1, rec._2._2)) }
val finalRDD = idRDDconsiceMap.reduceByKey((acc,value)=>(???add each element of the arrays????))

def getDateDifference（dateStr:String）：Int={
val startDate=“01-01-2016”
val formatter=DateTimeFormatter.of模式（“MM dd yyyy”）
val oldDate=LocalDate.parse（startDate，格式化程序）
val currentDate=dateStr
val newDate=LocalDate.parse（currentDate，格式化程序）
返回newDate.toEpochDay（）.toInt-oldDate.toEpochDay（）.toInt
}
def getArray（numberofDays:Int，data:Int）：Iterable[Int]={
val daysArray=新数组[Int]（366）
daysArray（numberofDays）=数据
回程日
}
val idRDD=
val idRDDMap=idRDD.map{rec=>（（rec.split（“，”）（0），rec.split（“，”）（1）），
（getDateDifference（rec.split（“，”）（1）），rec.split（“，”（2.toInt））}
val idrdconsiscemap=idRDDMap.map{rec=>（rec.'u1.'u1，getArray（rec.'u2.'u1，rec.'u2.'u2））}
val finalRDD=idrdconsiscemap.reduceByKey（（acc，value）=>（？？？添加数组的每个元素？？）

def getDateDifference(dateStr:String):Int = {
val startDate = "01-01-2016" 
val formatter = DateTimeFormatter.ofPattern("MM-dd-yyyy")
val oldDate = LocalDate.parse(startDate, formatter)
val currentDate = dateStr
val newDate = LocalDate.parse(currentDate, formatter)
return newDate.toEpochDay().toInt - oldDate.toEpochDay().toInt
}
def getArray(numberofDays:Int,data:Int):Iterable[Int] = {
val daysArray = new Array[Int](366)
daysArray(numberofDays) = data
return daysArray
}
val idRDD = <read from stream>
val idRDDMap = idRDD.map { rec => ((rec.split(",")(0),rec.split(",")(1)),
        (getDateDifference(rec.split(",")(1)),rec.split(",")(2).toInt))}
val idRDDconsiceMap = idRDDMap.map { rec => (rec._1._1,getArray(rec._2._1, rec._2._2)) }
val finalRDD = idRDDconsiceMap.reduceByKey((acc,value)=>(???add each element of the arrays????))

def getDateDifference（dateStr:String）：Int={
val startDate=“01-01-2016”
val formatter=DateTimeFormatter.of模式（“MM dd yyyy”）
val oldDate=LocalDate.parse（startDate，格式化程序）
val currentDate=dateStr
val newDate=LocalDate.parse（currentDate，格式化程序）
返回newDate.toEpochDay（）.toInt-oldDate.toEpochDay（）.toInt
}
def getArray（numberofDays:Int，data:Int）：Iterable[Int]={
val daysArray=新数组[Int]（366）
daysArray（numberofDays）=数据
回程日
}
val idRDD=
val idRDDMap=idRDD.map{rec=>（（rec.split（“，”）（0），rec.split（“，”）（1）），
（getDateDifference（rec.split（“，”）（1）），rec.split（“，”（2.toInt））}
val idrdconsiscemap=idRDDMap.map{rec=>（rec.'u1.'u1，getArray（rec.'u2.'u1，rec.'u2.'u2））}
val finalRDD=idrdconsiscemap.reduceByKey（（acc，value）=>（？？？添加数组的每个元素？？）

这是Spark流媒体还是结构化流媒体？到目前为止您尝试了什么？问题在哪里？问题是列表的动态更新以及如何减少如果我按id减少所有值将被聚合，而不考虑年份的哪一天您已经有了什么代码？这是Spark流媒体吗？您能否解释一下为什么要将结果放入最后是数组？这是Spark流还是结构化流？到目前为止你尝试了什么？问题在哪里？问题是列表的动态更新以及如何减少如果我按id减少所有值都将被聚合，而不管一年中的哪一天你已经有了什么代码？这是Spark流吗？你能解释一下为什么要将r数组中的最终结果？很抱歉忘记提及我使用的是scalaDoesn没关系。上面的代码可以很容易地转换为scalai。我不认为上面的逻辑适用于id=id2的情况，因为给定样本数据中id2的每个条目的日期都不一样。它会起作用，因为我正在减少字符串形式。如果id=i2，我的流将是：[]我认为reduce应该只在id列上，date列只起决定值在列表中的位置的作用。我可能完全错误地理解了这个问题。另外，请仔细阅读我的解决方案，它可能会让我的观点更清楚。对不起，忘了提一下我使用的是scalaDoesn，没关系。相同的上述代码可以很容易地转换为scala我不认为上述逻辑适用于id=id2的情况，因为给定样本数据中id2的每个条目的日期都不同。它会起作用，因为我正在减少一个字符串的形式。如果id=i2，我的流将是：[]我认为reduce应该只在id列上，date列只起决定值在列表中的位置的作用。我可能完全错误地理解了这个问题。另外，请仔细阅读我的解决方案，这可能会让我的观点更清楚。