Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 筛选记录以检查是否存在给定java.lang.NullPointerException的特定列_Scala_Apache Spark_Apache Spark Sql_Nullpointerexception - Fatal编程技术网

Scala 筛选记录以检查是否存在给定java.lang.NullPointerException的特定列

Scala 筛选记录以检查是否存在给定java.lang.NullPointerException的特定列,scala,apache-spark,apache-spark-sql,nullpointerexception,Scala,Apache Spark,Apache Spark Sql,Nullpointerexception,所以我有一个这种格式的记录数据框- { "table": "SYSMAN.EM_METRIC_COLUMN_VER_E", "op_type": "I", "op_ts": "2021-03-24 13:15:31.396105", "pos": "00000000000000000000", &q

所以我有一个这种格式的记录数据框-

{
    "table": "SYSMAN.EM_METRIC_COLUMN_VER_E",
    "op_type": "I",
    "op_ts": "2021-03-24 13:15:31.396105",
    "pos": "00000000000000000000",
    "after": {
        "METRIC_GROUP_ID": 4700,
        "METRIC_COLUMN_ID": 293339,
        "METRIC_GROUP_VERSION_ID": 41670
    }
}
我想根据某个列的存在来过滤这些记录。如果它在“after”结构中有该列(如METRIC\u GROUP\u ID、METRIC\u column\u ID、METRIC\u GROUP\u VERSION\u ID),我想将其添加到列表中

这是我写的代码-

def HasColumn(row: Row, Column:String) = 
Try(row.getAs[Row]("before").getAs[Any](Column)).isSuccess || Try(row.getAs[Row]("after").getAs[Any](Column)).isSuccess

var records_list: List[Row] = null   

for(row<-inputDS){if(HasColumn(row,Column_String)){records_list:+row}}
我知道,您无法从传递到Spark的DataFrame/RDD转换之一的函数中访问Spark的任何“驱动程序端”抽象(RDD、数据帧、数据集、SparkSession…),因为它们只存在于您的驱动程序应用程序中。所以我尽可能地避免它,但我没有得到任何解决方案。

尝试下面的代码

创建自定义项

scala> def hasColumn = udf((row:Row,column:String) => Try(row.getAs[Row]("before").getAs[Any](column)).isSuccess || Try(row.getAs[Row]("after").getAs[Any](column)).isSuccess)
使用UDF检查列是否可用

scala> df.withColumn("has",hasColumn(struct($"*"),lit("METRIC_COLUMN_ID"))).show(false)
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
|after              |op_ts                     |op_type|pos                 |table                        |has |
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
|[293339,4700,41670]|2021-03-24 13:15:31.396105|I      |00000000000000000000|SYSMAN.EM_METRIC_COLUMN_VER_E|true|
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
并在新列上添加过滤条件

scala> df.withColumn("has",hasColumn(struct($"*"),lit("METRIC_COLUMN_ID"))).filter($"has" === true).show(false)
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
|after              |op_ts                     |op_type|pos                 |table                        |has |
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
|[293339,4700,41670]|2021-03-24 13:15:31.396105|I      |00000000000000000000|SYSMAN.EM_METRIC_COLUMN_VER_E|true|
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+

我不知道是作为RDD还是作为数据集来做,如果是RDD,解决方案看起来就像一个函数

def filterData(数据:RDD[行],列:字符串):RDD[行]={
data.filter{r=>
Try(r.getAs[Row](“before”).getAs[Any](列))
.orElse(Try(r.getAs[Row](“after”).getAs[Any](column)))
.isSuccess
}
}
如果您想减少那里的代码量,我们可以这样做

def filterData(数据:RDD[行],列:字符串):RDD[行]={
data.filter{r=>
Seq(“before”、“after”).map(c=>Try(r.getAs[Row](c)).map(u.getAs[Any](column)))
.reduce(uorelse)。isSuccess
}
}
这样做的好处是,如果您想添加更多的搜索位置,而不仅仅是在搜索之前和之后,您只需将其添加到
Seq

对于数据集,您只需要检查列是否存在并且是否为非空

df.where(列).isNotNull)

实际上,两者都假设您有一个固定的模式(即使是推断的模式),因此数据集模式要简单得多。

您能发布完整的代码吗?
scala> df.withColumn("has",hasColumn(struct($"*"),lit("METRIC_COLUMN_ID"))).filter($"has" === true).show(false)
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
|after              |op_ts                     |op_type|pos                 |table                        |has |
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
|[293339,4700,41670]|2021-03-24 13:15:31.396105|I      |00000000000000000000|SYSMAN.EM_METRIC_COLUMN_VER_E|true|
+-------------------+--------------------------+-------+--------------------+-----------------------------+----+
scala> df.withColumn("has",hasColumn(struct($"*"),lit("Column_does_not_exist"))).filter($"has" === true).show(false)
+-----+-----+-------+---+-----+---+
|after|op_ts|op_type|pos|table|has|
+-----+-----+-------+---+-----+---+
+-----+-----+-------+---+-----+---+