Json Spark Scala-嵌套案例类的检查字段

Json Spark Scala-嵌套案例类的检查字段,json,scala,apache-spark,Json,Scala,Apache Spark,我有三个案例类,如下所示: case class Result( result: Seq[Signal], hop: Int) case class Signal( rtt: Double, from: String) case class Traceroute( dst_name: String, from: String, prb_id: BigInt, msm_id: BigInt, timestamp: Bi

我有三个案例类,如下所示:

case class Result(
   result: Seq[Signal],
   hop:    Int)

case class Signal(
   rtt:  Double,
   from: String)

case class Traceroute(
  dst_name:  String,
  from:      String,
  prb_id:    BigInt,
  msm_id:    BigInt,
  timestamp: BigInt,
  result:    Seq[Result])
def checkSignal(signal: Signal): Signal = {
  if (signal.rtt > 0) {
    return signal
  } else {
    return null
  }

}
跟踪路由具有字段结果,这是一个结果序列。每个结果都是一个信号序列

我尝试检查结果字段是否为非负值。 我的json记录如下:

{"prb_id": 4247, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}, {"rtt": 1.7, "ttl": 255, "from": "10.10.0.5", "size": 28}, {"rtt": 1.709, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768409, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768402, "result": [{"result": [{"rtt": 19.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 2}]}
为了清楚起见,我在json记录中添加了一些属性。result属性是Traceroute case类中的结果字段

我使用了一个滤波器来检查rtt in信号是否为note负,但我没有得到预期的结果

val checkrtts = checkError.filter(x => x.result.foreach(p => p.result.foreach(f => checkSignal(f))))
检查信号功能如下所示:

case class Result(
   result: Seq[Signal],
   hop:    Int)

case class Signal(
   rtt:  Double,
   from: String)

case class Traceroute(
  dst_name:  String,
  from:      String,
  prb_id:    BigInt,
  msm_id:    BigInt,
  timestamp: BigInt,
  result:    Seq[Result])
def checkSignal(signal: Signal): Signal = {
  if (signal.rtt > 0) {
    return signal
  } else {
    return null
  }

}
给出两个跟踪路由实例的示例:

{"timestamp": 1514768409, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768402, "result": [{"result": [{"rtt": -2.5, "ttl": 255, "from": "89.105.200.57", "size": 28},{"rtt": 19.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 2}]}
对于第一个跟踪路由,不应用任何更改。 对于第二个Traceroute,result.result字段具有两个元素类型的信号,第一个信号具有负rtt,因此我应该从result.result中删除此信号。但不应删除第二个信号

因此,输出应如下所示:

{"prb_id": 4247, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}, {"rtt": 1.7, "ttl": 255, "from": "10.10.0.5", "size": 28}, {"rtt": 1.709, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768409, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768402, "result": [{"result": [{"rtt": 19.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 2}]}

请帮忙。我是spark和scala的新手。我尝试了很多方法,但结果并不像预期的那样。

对于过滤函数应该做什么,您似乎有一点误解。它从返回false的数据集中过滤整个Traceroute对象。您需要做的是编写一个映射函数,该函数将把原始的Traceroute对象转换为所需的对象。下面是如何为Dataset[Traceroute]执行此操作的示例

val checkrtts = checkError.filter(x => x.result.foreach(p => p.result.foreach(f => checkSignal(f))))
首先,您需要稍微修改case类,如下所示

case class Result(var result: Seq[Signal],
                   hop:    Int)

case class Signal(rtt:  Double,
                   from: String)

case class Traceroute( dst_name:  String,
                       from:      String,
                       prb_id:    BigInt,
                       msm_id:    BigInt,
                       timestamp: BigInt,
                       result:    Seq[Result])
如您所见,我已将var添加到result类的result字段中。这将有助于我们稍后在自定义函数中修改结果字段,我们将把它传递给映射操作

然后定义以下两个函数,如下所示:

def checkSignal(signal: Signal): Boolean = {
    if (signal.rtt > 0) {
      return true
    } else {
      return false
    }

  }

 def removeNegative(traceroute: Traceroute): Traceroute = {

    val outerList = traceroute.result
    for( temp <- outerList){

      val innerList = temp.result
      //here we are filtering the list to only contain nonnegative elements
      val newinnerList = innerList.filter(checkSignal(_))
      //here we are reassigning the newlist to result
      temp.result = newinnerList

    }

    traceroute
  }
输出结果如下:

Showing 10 rows of original dataset
+--------+----+------+------+----------+-------------------------------------------------------+
|dst_name|from|prb_id|msm_id|timestamp |result                                                 |
+--------+----+------+------+----------+-------------------------------------------------------+
|null    |null|null  |null  |1514768409|[[[[1.955, 89.105.200.57]], 1]]                        |
|null    |null|null  |null  |1514768402|[[[[-2.5, 89.105.200.57], [19.955, 89.105.200.57]], 2]]|
+--------+----+------+------+----------+-------------------------------------------------------+

Showing 10 rows of transformed dataset
+--------+----+------+------+----------+--------------------------------+
|dst_name|from|prb_id|msm_id|timestamp |result                          |
+--------+----+------+------+----------+--------------------------------+
|null    |null|null  |null  |1514768409|[[[[1.955, 89.105.200.57]], 1]] |
|null    |null|null  |null  |1514768402|[[[[19.955, 89.105.200.57]], 2]]|
+--------+----+------+------+----------+--------------------------------+

你能补充一下你得到了什么和你的期望是什么吗?如果需要,添加更多数据行以正确显示输出。