Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 基于另一列值更新数据帧的列_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 基于另一列值更新数据帧的列

Scala 基于另一列值更新数据帧的列,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我正在尝试使用Scala中另一列的值更新列的值 这是我的数据框中的数据: +-------------------+------+------+-----+------+----+--------------------+-----------+ |UniqueRowIdentifier | u c0 | u c1 | u c2 | u c3 | u c4 | u c5 | isBadRecord| +-------------------+------+------+-----+------+

我正在尝试使用Scala中另一列的值更新列的值

这是我的数据框中的数据:

+-------------------+------+------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier | u c0 | u c1 | u c2 | u c3 | u c4 | u c5 | isBadRecord|
+-------------------+------+------+-----+------+----+--------------------+-----------+
|1 | 0 | 0 |姓名| 0 |描述| 0|
|2 | 2.11 | 10000 |果汁| 0 | XYZ | 2016/12/31:Inco…| 0|
|3 |-0.500 |-24.12 |水果|-255 | ABC | 1994-11-21 00:00:00 ||
|4 | 0.087 | 1222 |面包|-22.06 | 2017-02-14 00:00:00 |0|
|5 | 0.087 | 1222 |面包|-22.06 | | 0|
+-------------------+------+------+-----+------+----+--------------------+-----------+
这里,\u c5列包含一个不正确的值(第2行中的值包含字符串不正确的),我希望根据该值将其isBadRecord字段更新为1

有办法更新此字段吗?

您可以使用api,并使用其中一个api来满足您填写1的不良记录的需要

对于您的情况,您可以编写一个
udf
函数

def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
并称之为

val newDF = dataframe.withColumn("isBadRecord", fillbad(dataframe("_c5")))
您可以使用api并使用其中一个api来满足您的需要,以填写1的不良记录

对于您的情况,您可以编写一个
udf
函数

def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
并称之为

val newDF = dataframe.withColumn("isBadRecord", fillbad(dataframe("_c5")))

最好的选择是创建一个UDF并尝试将其转换为do Date格式。 如果可以转换,则返回0,否则返回1

即使你的日期格式不好,这也会起作用

      val spark = SparkSession.builder().master("local")
        .appName("test").getOrCreate()

      import spark.implicits._

//create test dataframe
      val data = spark.sparkContext.parallelize(Seq(
        (1,"1994-11-21 Xyz"),
        (2,"1994-11-21 00:00:00"),
        (3,"1994-11-21 00:00:00")
      )).toDF("id", "date")

// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure 
      val check = udf((value: String) => {
        Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
          case Success(d) => 1
          case Failure(e) => 0
        }
      })

// Add column 
      data.withColumn("badData", check($"date")).show

希望这有帮助

最好的选择是创建一个UDF并尝试将其转换为do Date格式。 如果可以转换,则返回0,否则返回1

即使你的日期格式不好,这也会起作用

      val spark = SparkSession.builder().master("local")
        .appName("test").getOrCreate()

      import spark.implicits._

//create test dataframe
      val data = spark.sparkContext.parallelize(Seq(
        (1,"1994-11-21 Xyz"),
        (2,"1994-11-21 00:00:00"),
        (3,"1994-11-21 00:00:00")
      )).toDF("id", "date")

// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure 
      val check = udf((value: String) => {
        Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
          case Success(d) => 1
          case Failure(e) => 0
        }
      })

// Add column 
      data.withColumn("badData", check($"date")).show

希望这有帮助

我建议您像在SQL中那样考虑它,而不是推理更新它;您可以执行以下操作:

import org.spark.sql.functions.when

val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe

import spark.implicits._

df.select(
  $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
  $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          0|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          1|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
以下是一个自包含的脚本,您可以复制并粘贴到Spark shell上,以便在本地查看结果:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

sc.setLogLevel("ERROR")

val schema = 
  StructType(Seq(
    StructField("UniqueRowIdentifier", IntegerType),
    StructField("_c0", DoubleType),
    StructField("_c1", DoubleType),
    StructField("_c2", StringType),
    StructField("_c3", DoubleType),
    StructField("_c4", StringType),
    StructField("_c5", StringType),
    StructField("isBadRecord", IntegerType)))

val contents =
  Seq(
    Row(1,  0.0  ,     0.0 ,  "Name",    0.0, "Desc",                       "", 0),
    Row(2,  2.11 , 10000.0 , "Juice",    0.0,  "XYZ", "2016/12/31 : Incorrect", 0),
    Row(3, -0.5  ,   -24.12, "Fruit", -255.0,  "ABC",    "1994-11-21 00:00:00", 0),
    Row(4,  0.087,  1222.0 , "Bread",  -22.06,    "",    "2017-02-14 00:00:00", 0),
    Row(5,  0.087,  1222.0 , "Bread",  -22.06,    "",                       "", 0)
  )

val df = spark.createDataFrame(sc.parallelize(contents), schema)

df.show()

val withBadRecords =
  df.select(
    $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
    $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")

withBadRecords.show()
其相关输出如下:

import org.spark.sql.functions.when

val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe

import spark.implicits._

df.select(
  $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
  $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          0|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          1|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

我建议您像在SQL中那样考虑它,而不是对其进行更新;您可以执行以下操作:

import org.spark.sql.functions.when

val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe

import spark.implicits._

df.select(
  $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
  $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          0|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          1|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
以下是一个自包含的脚本,您可以复制并粘贴到Spark shell上,以便在本地查看结果:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

sc.setLogLevel("ERROR")

val schema = 
  StructType(Seq(
    StructField("UniqueRowIdentifier", IntegerType),
    StructField("_c0", DoubleType),
    StructField("_c1", DoubleType),
    StructField("_c2", StringType),
    StructField("_c3", DoubleType),
    StructField("_c4", StringType),
    StructField("_c5", StringType),
    StructField("isBadRecord", IntegerType)))

val contents =
  Seq(
    Row(1,  0.0  ,     0.0 ,  "Name",    0.0, "Desc",                       "", 0),
    Row(2,  2.11 , 10000.0 , "Juice",    0.0,  "XYZ", "2016/12/31 : Incorrect", 0),
    Row(3, -0.5  ,   -24.12, "Fruit", -255.0,  "ABC",    "1994-11-21 00:00:00", 0),
    Row(4,  0.087,  1222.0 , "Bread",  -22.06,    "",    "2017-02-14 00:00:00", 0),
    Row(5,  0.087,  1222.0 , "Bread",  -22.06,    "",                       "", 0)
  )

val df = spark.createDataFrame(sc.parallelize(contents), schema)

df.show()

val withBadRecords =
  df.select(
    $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
    $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")

withBadRecords.show()
其相关输出如下:

import org.spark.sql.functions.when

val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe

import spark.implicits._

df.select(
  $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
  $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          0|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          1|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

如何使用withColumn api检查一列的值并在此基础上更新另一列的值?更新了我的答案请检查而不是创建UDF来检查简单的“包含”,在withColumn本身中使用“包含”会更好吗!我已经用过很多次了。df.withColumn(“isBadRecord”,当(col(“_c5”).包含(“不正确”)时,1)。否则(0))@AvikAggarwal不是对一个工作答案进行评论和否决,为什么不用另一个答案回答这个问题。我再次感谢对一个有效答案的否决投票。我如何使用withColumn api检查一列的值,并在此基础上更新另一列?更新了我的答案请检查,而不是创建一个UDF来检查一个简单的“包含”,在withColumn本身中使用“包含”会更好吗!我已经用过很多次了。df.withColumn(“isBadRecord”,当(col(“_c5”).包含(“不正确”)时,1)。否则(0))@AvikAggarwal不是对一个工作答案进行评论和否决,为什么不用另一个答案回答这个问题。我再次感谢对有效答案的否决投票,如果
\u c5
2016/12/31:不正确的数据作为值,答案是否仍然有效?我想不是。你是对的,我修正了我的答案以反映原始问题中的要求。如果
\u c5
2016/12/31:不正确的数据作为值,答案仍然有效吗?我想不是。你是对的,我修正了我的答案以反映原始问题中的要求。如果你在字段值中有不同的词而不是不正确的词,那么上面的答案有效吗?如果你在字段值中有不同的词而不是不正确的词,那么上面的答案有效吗?