Apache spark 根据spark中第二个rdd的值减去rdd的行

Apache spark 根据spark中第二个rdd的值减去rdd的行,apache-spark,rdd,Apache Spark,Rdd,我有两个RDD名称releventResults和ranoms: releventResults包含以下数据: 2:DestIP:173.194.116.42,1:SrIP:172.20.16.121,3:DestPort:80,=>4:Time_Range:11:00-12:00 = 1.0 2:DestIP:172.20.16.4,1:SrIP:172.20.16.51,3:DestPort:80,=>4:Time_Range:16:00-17:00 = 0.13 2:DestI

我有两个RDD名称releventResults和ranoms:

releventResults包含以下数据:

2:DestIP:173.194.116.42,1:SrIP:172.20.16.121,3:DestPort:80,=>4:Time_Range:11:00-12:00 = 1.0
2:DestIP:172.20.16.4,1:SrIP:172.20.16.51,3:DestPort:80,=>4:Time_Range:16:00-17:00 = 0.13
2:DestIP:216.92.251.5,4:Time_Range:10:00-11:00,3:DestPort:80,=>1:SrIP:172.20.16.64 = 1.0
2:DestIP:172.20.16.9,1:SrIP:172.20.16.82,3:DestPort:80,=>4:Time_Range:17:00-18:00 = 0.13
2:DestIP:190.93.247.58,1:SrIP:172.20.16.102,4:Time_Range:12:00-13:00,=>3:DestPort:80 = 1.0
2:DestIP:140.98.193.112,1:SrIP:172.20.16.110,3:DestPort:80,=>4:Time_Range:15:00-16:00 = 0.9
2:DestIP:91.189.92.201,1:SrIP:172.20.16.58,3:DestPort:80,=>4:Time_Range:11:00-12:00 = 1.0
1:SrIP:172.20.16.121,4:Time_Range:09:00-10:00,3:DestPort:80,=>2:DestIP:199.27.79.196 = 0.03
1:SrIP:172.20.16.111,4:Time_Range:10:00-11:00,3:DestPort:80,=>2:DestIP:185.31.19.196 = 0.01
2:DestIP:88.221.48.112,1:SrIP:172.20.16.107,4:Time_Range:16:00-17:00,=>3:DestPort:80 = 1.0
1:SrIP:172.20.16.60,2:DestIP:91.189.92.152,3:DestPort:80,=>4:Time_Range:07:00-8:00 = 1.0
4:Time_Range:14:00-15:00,1:SrIP:172.20.16.51,3:DestPort:80,=>2:DestIP:172.20.16.7 = 0.15
2:DestIP:172.20.16.10,1:SrIP:172.20.16.82,4:Time_Range:11:00-12:00,=>3:DestPort:3910 = 1.0
2:DestIP:198.252.206.16,4:Time_Range:12:00-13:00,1:SrIP:172.20.16.106,=>3:DestPort:80 = 1.0
2:DestIP:23.235.43.130,4:Time_Range:13:00-14:00,3:DestPort:80,=>1:SrIP:172.20.16.106 = 1.0
1:SrIP:172.20.16.76,2:DestIP:172.20.16.64,4:Time_Range:17:00-18:00,=>3:DestPort:2869 = 1.0
ranoms1包含:

1:SrIP:172.20.16.103 2:DestIP:54.225.129.170 3:DestPort:80 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.89 2:DestIP:172.20.16.83 3:DestPort:5357 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.105 2:DestIP:110.93.194.234 3:DestPort:80 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.84 2:DestIP:172.20.16.64 3:DestPort:2869 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.96 2:DestIP:82.178.158.26 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.105 2:DestIP:82.163.79.170 3:DestPort:80 4:Time_Range:10:00-11:00
1:SrIP:172.20.16.115 2:DestIP:92.122.48.122 3:DestPort:80 4:Time_Range:10:00-11:00
1:SrIP:172.20.16.105 2:DestIP:46.102.243.70 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.51 2:DestIP:216.34.181.59 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.31 2:DestIP:95.101.72.17 3:DestPort:80 4:Time_Range:10:00-11:00
1:SrIP:172.20.16.51 2:DestIP:54.75.236.43 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.103 2:DestIP:68.232.34.200 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.89 2:DestIP:172.20.16.34 3:DestPort:5357 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.124 2:DestIP:107.20.214.255 3:DestPort:80 4:Time_Range:11:00-12:00
我有以下代码:

 var finalRanoms = ranoms1
   .filter(_.split.map(p=> p(0)+" "+p(1)+" "+p(2)+" "+p(3)
     (releventResults.map(x=>x.contains(p(1))))))

我想过滤ranoms1相关结果中第二个元素DestIP包含的那些行,RDD中的过滤器意味着满足给定RDD上的谓词。 您可以使用API获得包含相同元素的结果RDD。 您可以使用/API获得包含a-B集合元素的结果RDD

val duplicates = rdd1.intersection(rdd2)  
val nonDuplicates = rdd1.subtract(rdd2)
val nonDuplicatesByKey = rdd1.subtractByKey(rdd2)
为了通过rdd1中存在的IP过滤掉rdd2,我将两者都转换为键值RDDs,其中IP用作键,然后通过键减去:

val rdd1Pairs = rdd1.map(x => (getIpKeyFromRdd1(x), x))
val rdd2Pairs = rdd2.map(x => (getIpKeyFromRdd2(x), x))
val nonDuplicatesByKey = rdd2Pairs.subtractByKey(rdd1Pairs)
val rdd2Filtered = nonDuplicatesByKey.values()

您必须实现getIPKeysFromRdd。

您使用的是哪一版本的Spark?OP的可能副本实际上希望选择不删除元素。@rdd1和rdd2的Leet Falcon交集已经为零,但我想检查rdd1中rdd2元素的子字符串是否包含,例如在上面的示例数据集中,如果rdd1包含rdd2行中的2:DestIP:54.225.129.170,然后我想从rdd2中减去包含此DestIP:54.255.129.170的所有行