Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何在spark中连接多个列,同时从另一个表中连接列名(每行不同)_Scala_Apache Spark_Apache Spark Sql_Spark Dataframe - Fatal编程技术网

Scala 如何在spark中连接多个列,同时从另一个表中连接列名(每行不同)

Scala 如何在spark中连接多个列,同时从另一个表中连接列名(每行不同),scala,apache-spark,apache-spark-sql,spark-dataframe,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,我正在尝试使用concat函数concat spark中的多个列 例如,下面是我必须为其添加新连接列的表 table - **t** +---+----+ | id|name| +---+----+ | 1| a| | 2| b| +---+----+ 下表列出了关于给定id要连接哪些列的信息(对于id 1,需要连接列id和名称,对于id 2,仅连接id) 如果我连接两个表并执行下面的操作,我可以合并,但不能基于表r(因为新列的第一行有1,但第二行应该只有2) 在上面的

我正在尝试使用concat函数concat spark中的多个列

例如,下面是我必须为其添加新连接列的表

table - **t**
+---+----+  
| id|name|
+---+----+  
|  1|   a|  
|  2|   b|
+---+----+
下表列出了关于给定id要连接哪些列的信息(对于id 1,需要连接列id和名称,对于id 2,仅连接id)

如果我连接两个表并执行下面的操作,我可以合并,但不能基于表r(因为新列的第一行有1,但第二行应该只有2)

在上面的查询中,我必须在select之前应用filter,但我不确定如何在withColumn中为每一行应用filter

如果可能的话,下面的例子

t.withColumn("new",concat_ws(",",t.**filter**("id="+this.id).select("att").first.mkString.split(",").map(c => col(c)): _*)).show
因为需要根据id筛选每一行

scala> t.filter("id=1").select("att").first.mkString.split(",").map(c => col(c))
res90: Array[org.apache.spark.sql.Column] = Array(id, name)

scala> t.filter("id=2").select("att").first.mkString.split(",").map(c => col(c))
res89: Array[org.apache.spark.sql.Column] = Array(id)
以下是最终要求的结果

+---+----+-------+---+
| id|name|  att  |new|
+---+----+-------+---+
|  1|   a|id,name|1,a|
|  2|   b|  id   |2  |
+---+----+-------+---+

这可以在UDF中完成:

val cols: Seq[Column] = dataFrame.columns.map(x => col(x)).toSeq
val indices: Seq[String] = dataFrame.columns.map(x => x).toSeq

val generateNew = udf((values: Seq[Any]) => {
  val att = values(indices.indexOf("att")).toString.split(",")
  val associatedIndices = indices.filter(x => att.contains(x))
  val builder: StringBuilder  = StringBuilder.newBuilder
  values.filter(x => associatedIndices.contains(values.indexOf(x)))
  values.foreach{ v => builder.append(v).append(";") }
  builder.toString()
})

val dfColumns = array(cols:_*)
val dNew = dataFrame.withColumn("new", generateNew(dfColumns))
这只是一个草图,但其思想是您可以将一系列项传递给用户定义的函数,并动态选择所需的项


请注意,您可以传递其他类型的集合/映射-例如,这可以在UDF中完成:

val cols: Seq[Column] = dataFrame.columns.map(x => col(x)).toSeq
val indices: Seq[String] = dataFrame.columns.map(x => x).toSeq

val generateNew = udf((values: Seq[Any]) => {
  val att = values(indices.indexOf("att")).toString.split(",")
  val associatedIndices = indices.filter(x => att.contains(x))
  val builder: StringBuilder  = StringBuilder.newBuilder
  values.filter(x => associatedIndices.contains(values.indexOf(x)))
  values.foreach{ v => builder.append(v).append(";") }
  builder.toString()
})

val dfColumns = array(cols:_*)
val dNew = dataFrame.withColumn("new", generateNew(dfColumns))
这只是一个草图,但其思想是您可以将一系列项传递给用户定义的函数,并动态选择所需的项

请注意,您可以传递其他类型的集合/映射-例如,我们可以使用UDF

此逻辑的工作要求

t的列名顺序应与表r的列att中的列名顺序相同

scala> input_df_1.show
+---+----+
| id|name|
+---+----+
|  1|   a|
|  2|   b|
+---+----+

scala> input_df_2.show
+---+-------+
| id|    att|
+---+-------+
|  1|id,name|
|  2|     id|
+---+-------+

scala> val join_df = input_df_1.join(input_df_2,Seq("id"),"inner")
join_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> val req_cols = input_df_1.columns
req_cols: Array[String] = Array(id, name)

scala> def new_col_udf = udf((cols : Seq[String],row : String,attr : String) => {
     |     val row_values = row.split(",")
     |     val attrs = attr.split(",")
     |     val req_val = attrs.map{at =>
     |     val index = cols.indexOf(at)
     |     row_values(index)
     |     }
     |     req_val.mkString(",")
     |     })
new_col_udf: org.apache.spark.sql.expressions.UserDefinedFunction

scala>  val intermediate_df = join_df.withColumn("concat_column",concat_ws(",",'id,'name)).withColumn("new_col",new_col_udf(lit(req_cols),'concat_column,'att))
intermediate_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]

scala> val result_df = intermediate_df.select('id,'name,'att,'new_col)
result_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]

scala> result_df.show
+---+----+-------+-------+
| id|name|    att|new_col|
+---+----+-------+-------+
|  1|   a|id,name|    1,a|
|  2|   b|     id|      2|
+---+----+-------+-------+
希望它能回答您的问题。

我们可以使用UDF

此逻辑的工作要求

t的列名顺序应与表r的列att中的列名顺序相同

scala> input_df_1.show
+---+----+
| id|name|
+---+----+
|  1|   a|
|  2|   b|
+---+----+

scala> input_df_2.show
+---+-------+
| id|    att|
+---+-------+
|  1|id,name|
|  2|     id|
+---+-------+

scala> val join_df = input_df_1.join(input_df_2,Seq("id"),"inner")
join_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> val req_cols = input_df_1.columns
req_cols: Array[String] = Array(id, name)

scala> def new_col_udf = udf((cols : Seq[String],row : String,attr : String) => {
     |     val row_values = row.split(",")
     |     val attrs = attr.split(",")
     |     val req_val = attrs.map{at =>
     |     val index = cols.indexOf(at)
     |     row_values(index)
     |     }
     |     req_val.mkString(",")
     |     })
new_col_udf: org.apache.spark.sql.expressions.UserDefinedFunction

scala>  val intermediate_df = join_df.withColumn("concat_column",concat_ws(",",'id,'name)).withColumn("new_col",new_col_udf(lit(req_cols),'concat_column,'att))
intermediate_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]

scala> val result_df = intermediate_df.select('id,'name,'att,'new_col)
result_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]

scala> result_df.show
+---+----+-------+-------+
| id|name|    att|new_col|
+---+----+-------+-------+
|  1|   a|id,name|    1,a|
|  2|   b|     id|      2|
+---+----+-------+-------+
希望它能回答你的问题