Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/362.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Spark:从A列到B列的字符串操作_Python_Scala_Apache Spark_Hadoop - Fatal编程技术网

Python Spark:从A列到B列的字符串操作

Python Spark:从A列到B列的字符串操作,python,scala,apache-spark,hadoop,Python,Scala,Apache Spark,Hadoop,大家好,我是spark的新手,我想知道如何进行字符串操作,使Column1-Column2得到column3 注意:我的数据在数据框中 所以基本上我有两个不同的列字符串,我只想得到列2中存在的字符串,而不是列1中的字符串,这样我就可以将它生成为列3 Column1 SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE Column2 SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY 那么第三栏应该是 Column3

大家好,我是spark的新手,我想知道如何进行字符串操作,使Column1-Column2得到column3

注意:我的数据在数据框中

所以基本上我有两个不同的列字符串,我只想得到列2中存在的字符串,而不是列1中的字符串,这样我就可以将它生成为列3

Column1
SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE

Column2
SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY
那么第三栏应该是

Column3
GRAPES,BERRY
但对于专栏1和专栏2,我也想展示一下

APPLE,BANANA,ORANGE 

只需删除
SAMPLE\u OUT\u 3
并对Spark>=2.4进行逗号分隔

除了

import spark.implicits._

val df = Seq(
  ("SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE" ,"SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY")
).toDF("column1", "column2")

val remove = df.columns.map(column => split(col(column), "3_").getItem(1).as(column))

val resultDF = df.select(remove: _*)
  .withColumn("column1", split($"column1", "\\|"))
  .withColumn("column2", split($"column2", "\\|"))
  .withColumn("column3", array_except($"column2", $"column1"))
  .withColumn("column1", array_except($"column1", $"column3"))
  .withColumn("column2", array_except($"column2", $"column3"))

val convertToString = resultDF.columns.map(column => concat_ws("|", col(column)).as(column))
resultDF.select(convertToString: _*).show(false)
输出:

+-------------------------+-------------------------+------------+
|column1                  |column2                  |column3     |
+-------------------------+-------------------------+------------+
|APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
+-------------------------+-------------------------+------------+
+-------------------------+--------------------------------------+------------+
|column1                  |column2                               |column3     |
+-------------------------+--------------------------------------+------------+
|APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
+-------------------------+--------------------------------------+------------+

对于火花>=2.4

除了

import spark.implicits._

val df = Seq(
  ("SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE" ,"SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY")
).toDF("column1", "column2")

val remove = df.columns.map(column => split(col(column), "3_").getItem(1).as(column))

val resultDF = df.select(remove: _*)
  .withColumn("column1", split($"column1", "\\|"))
  .withColumn("column2", split($"column2", "\\|"))
  .withColumn("column3", array_except($"column2", $"column1"))
  .withColumn("column1", array_except($"column1", $"column3"))
  .withColumn("column2", array_except($"column2", $"column3"))

val convertToString = resultDF.columns.map(column => concat_ws("|", col(column)).as(column))
resultDF.select(convertToString: _*).show(false)
输出:

+-------------------------+-------------------------+------------+
|column1                  |column2                  |column3     |
+-------------------------+-------------------------+------------+
|APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
+-------------------------+-------------------------+------------+
+-------------------------+--------------------------------------+------------+
|column1                  |column2                               |column3     |
+-------------------------+--------------------------------------+------------+
|APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
+-------------------------+--------------------------------------+------------+

您可以按如下所示的“|”拆分列 导入spark.implicits_

val df = mainDf.select("Column1","Column2").map(x => {
   val s1 = x.getAsString(0).replaceAll("^.*3_","").split("|");
   val s2 = x.getAsString(1).replaceAll("^.*3_","").split("|");
   (x.getAsString(0),x.getAsString(1),s2.diff(s1).union(s1.diff(s2)))
}
).toDF("Column1","Column2","Column3")

您可以按如下所示的“|”拆分列 导入spark.implicits_

val df = mainDf.select("Column1","Column2").map(x => {
   val s1 = x.getAsString(0).replaceAll("^.*3_","").split("|");
   val s2 = x.getAsString(1).replaceAll("^.*3_","").split("|");
   (x.getAsString(0),x.getAsString(1),s2.diff(s1).union(s1.diff(s2)))
}
).toDF("Column1","Column2","Column3")

您还可以通过regexp\u replace和udf实现您的目的

  • regexp_replace将“|”替换为“,”,将“*3_u”替换为“”
  • udf从column2和column1中获取column3的值
  • 输出:

    +-------------------------+-------------------------+------------+
    |column1                  |column2                  |column3     |
    +-------------------------+-------------------------+------------+
    |APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
    +-------------------------+-------------------------+------------+
    
    +-------------------------+--------------------------------------+------------+
    |column1                  |column2                               |column3     |
    +-------------------------+--------------------------------------+------------+
    |APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
    +-------------------------+--------------------------------------+------------+
    

    您还可以通过regexp\u replace和udf实现您的目的

  • regexp_replace将“|”替换为“,”,将“*3_u”替换为“”
  • udf从column2和column1中获取column3的值
  • 输出:

    +-------------------------+-------------------------+------------+
    |column1                  |column2                  |column3     |
    +-------------------------+-------------------------+------------+
    |APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
    +-------------------------+-------------------------+------------+
    
    +-------------------------+--------------------------------------+------------+
    |column1                  |column2                               |column3     |
    +-------------------------+--------------------------------------+------------+
    |APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
    +-------------------------+--------------------------------------+------------+
    

    您可以共享数据帧的架构吗?@koiralo |--column1:string(nullable=true)|--column2:string(nullable=true)您可以共享数据帧的架构吗?@koiralo |--column1:string(nullable=true)|--column2:string(nullable=true)第一个示例可以不同,但它总是有3个示例,第一个示例可以不同,但它总是有3个,我如何将它恢复到字符串?对于第1列和第2列,我想把它格式化成苹果、香蕉、番石榴、葡萄、橙子。我怎样才能把它还原成字符串呢?对于第1列和第2列,我想将其格式化为苹果、香蕉、番石榴、葡萄、桔梗