Apache spark 当Tuple2';s键是mapToPair中的原始对象

Apache spark 当Tuple2';s键是mapToPair中的原始对象,apache-spark,mapreduce,spark-streaming,Apache Spark,Mapreduce,Spark Streaming,我有一个用于流处理的JavaDStream sourceDStream 在这个数据流的mapToPair中,我使用输入对象作为Tuple2的键和值,如中所示 案例1: public Tuple2<SourceObject, SourceObject> call(SourceObject sourceObject) Tuple2<WidgetDetail, WidgetDetail> tuple2; tuple2 = new Tuple2<> (

我有一个用于流处理的JavaDStream sourceDStream

在这个数据流的mapToPair中,我使用输入对象作为Tuple2的键和值,如中所示

案例1:

public Tuple2<SourceObject, SourceObject> call(SourceObject sourceObject)
    Tuple2<WidgetDetail, WidgetDetail> tuple2;
    tuple2 = new Tuple2<> (sourceObject, sourceObject);
    return tuple2;
}
然而,火花

  • 当sourceDStream的大小很小(比如说50或更小)时,Spark不会调用SourceObject的equals,因此反过来根本不会调用reduceByKey。 因此,在调用foreachPartition时不会减少/合并重复键

  • 即使sourceDStream的大小更大,比如说100+,Spark也只为一小部分对象调用SourceObject的equals, 即使sourceDStream中有更多具有相同密钥的对象。因此,对于具有相同键的剩余多个对象,不调用reduceByKey

  • 上述两种情况都会导致foreachPartition需要处理的具有相同密钥的对象数量过多

    然而,当我使用包装器对象作为sourceObject的键时,如下面的代码所示

    案例2:

    public class SourceKey {
        private SourceObject sourceObject;
    
        public void setSourceObject (SourceObject sourceObject) {
          this.sourceObject = sourceObject;
        }
    
        public boolean equals (Object obj) {
          ...
        }
     }  
    
     public Tuple2<SourceKey, SourceKey> call(SourceObject sourceObject)
        Tuple2<WidgetDetail, WidgetDetail> tuple2;
        SourceKey sourceKey = new SourceKey ();
        sourceKey.setSourceObject(sourceObject);
        tuple2 = new Tuple2<> (sourceKey, sourceKey);
        return tuple2;
    }
    
    公共类源密钥{
    私有源对象源对象;
    public void setSourceObject(SourceObject SourceObject){
    this.sourceObject=sourceObject;
    }
    公共布尔等于(对象obj){
    ...
    }
    }  
    公共元组2调用(SourceObject SourceObject)
    Tuple2-Tuple2;
    SourceKey SourceKey=newsourcekey();
    sourceKey.setSourceObject(sourceObject);
    tuple2=新的tuple2(sourceKey,sourceKey);
    返回tuple2;
    }
    
    然后Spark按预期工作,为sourceDStream中的所有对象调用SourceKey的equals。因此,对具有相同键的所有对象调用reduceByKey

    对于案例1,当SourceObject也用作mapToPair的Tuple2中的键/值时,为什么Spark会跳过调用SourceObject的equals

    如何解决这个问题,并让Spark为sourceDStream中的所有对象调用SourceObject的equals,从而减少具有相同键的对象

    谢谢

    迈克尔

    public class SourceKey {
        private SourceObject sourceObject;
    
        public void setSourceObject (SourceObject sourceObject) {
          this.sourceObject = sourceObject;
        }
    
        public boolean equals (Object obj) {
          ...
        }
     }  
    
     public Tuple2<SourceKey, SourceKey> call(SourceObject sourceObject)
        Tuple2<WidgetDetail, WidgetDetail> tuple2;
        SourceKey sourceKey = new SourceKey ();
        sourceKey.setSourceObject(sourceObject);
        tuple2 = new Tuple2<> (sourceKey, sourceKey);
        return tuple2;
    }