Apache spark 带有scala的spark中dataframe列内的列名_Apache Spark_Apache Spark Sql

Apache spark 带有scala的spark中dataframe列内的列名

apache-spark

Apache spark 带有scala的spark中dataframe列内的列名,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我正在使用Scala的spark。2.4.3 我的salesperson数据框如下所示：它总共有54名salesperson，我只举了3列的例子 Schema of SalesPerson table. root |-- col: struct (nullable = false) | |-- SalesPerson_1: string (nullable = true) | |-- SalesPerson_2: string (nullable = true) | |

我正在使用Scala的spark。2.4.3

我的salesperson数据框如下所示：它总共有54名salesperson，我只举了3列的例子

Schema of SalesPerson table.
root
 |-- col: struct (nullable = false)
 |    |-- SalesPerson_1: string (nullable = true)
 |    |-- SalesPerson_2: string (nullable = true)
 |    |-- SalesPerson_3: string (nullable = true)

销售人员视图的数据

     SalesPerson_1|SalesPerson_2|SalesPerson_3
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    [Customer_1793,  Customer_202,  Customer_2461]
    [Customer_2424, Customer_130, Customer_787]
    [Customer_1061, Customer_318, Customer_706]
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

我的salesplace数据框看起来像

Schema of salesplace
 
 root
 |-- Place: string (nullable = true)
 |-- Customer: string (nullable = true)

Data of salesplace
Place|Customer
Online| Customer_1793
Retail| Customer_1793
Retail| Customer_130
Online| Customer_130
Online| Customer_2461
Retail| Customer_2461
Online| Customer_2461

我试图检查Salesperson表中的哪个客户在SalesPlace表中可用。有两个

附加列显示客户属于销售人员

和SalesPlace表中的客户发生计数

预期产出：

CustomerBelongstoSalesperson|Customer     |occurance|
SalesPerson_1               |Customer_1793|2
SalesPerson_2               |Customer_130 |2 
SalesPerson_3               |Customer_2461|3
SalesPerson_2               |Customer_202 |0
SalesPerson_1               |Customer_2424|0
SalesPerson_1               |Customer_1061|0
SalesPerson_2               |Customer_318 |0
SalesPerson_3               |Customer_787 |0

代码：

这在spark中似乎没有什么关键性。我不确定是否可以将columnname作为值带到列中。。。。有人能帮我想一想怎么做吗。。。。。。。。谢谢你试试这个-

加载提供的测试数据

val数据1=
"""
|销售人员1 |销售人员2
|客户17 |客户202
|客户24 |客户130
“.stripMargin”
val stringDS1=data1.split（System.lineSeparator（））
.map（\\\\\）.map（\.replaceAll（“^[\t]+\\t]+$”，“）.mkString（“，”）
.toSeq.toDS（）
val df1=spark.read
.期权（“sep”、“、”）
.选项（“推断模式”、“真”）
.选项（“标题”、“正确”）
.选项（“空值”、“空值”）
.csv（stringDS1）
df1.显示（错误）
df1.printSchema（）
/**
* +------------+------------+
*|销售人员1 |销售人员2|
* +------------+------------+
*|客户| 17 |客户| 202|
*|客户24 |客户130|
* +------------+------------+
*
*根
*|--salerson1:string（nullable=true）
*|--salerson2:string（nullable=true）
*/
val数据2=
"""
|地点|客户
|商店|顾客| 17
|主页|客户| 17
|商店|顾客| 17
|主页|客户| 130
|商店|客户| 202
“.stripMargin”
val stringDS2=data2.split（System.lineSeparator（））
.map（\\\\\）.map（\.replaceAll（“^[\t]+\\t]+$”，“）.mkString（“，”）
.toSeq.toDS（）
val df2=spark.read
.期权（“sep”、“、”）
.选项（“推断模式”、“真”）
.选项（“标题”、“正确”）
.选项（“空值”、“空值”）
.csv（stringDS2）
df2.显示（错误）
df2.printSchema（）
/**
* +-----+------------+
*|地点|客户|
* +-----+------------+
*|商店|顾客| 17|
*|家|客户| 17|
*|商店|顾客| 17|
*|家|客户| 130|
*|商店|客户| 202|
* +-----+------------+
*
*根
*|--Place:string（nullable=true）
*|--Customer:string（nullable=true）
*/

Unpivot

和

left join

val stringCol=df1.columns.map（c=>s“'$c'，cast（`c`as string）”）.mkString（“，”）
val processedDF=df1.selectExpr（s“stack（${df1.columns.length}，$stringCol）as（saleperson，Customer）”）
processedDF.show（false）
/**
* +------------+------------+
*|销售人员|客户|
* +------------+------------+
*|销售人员1 |客户17|
*|销售人员2 |客户| 202|
*|销售人员1 |客户24|
*|销售人员2 |客户| 130|
* +------------+------------+
*/
已处理的DF.连接（df2，序号（“客户”），“左”）
.groupBy（“客户”）
.agg（计数（“地点”）。作为（“发生”），第一（“销售人员”）。作为（“销售人员”））
.show（假）
/**
* +------------+---------+------------+
*|客户|发生|销售人员|
* +------------+---------+------------+
*|客户| 1 |销售人员2|
*|客户| 3 |销售人员1|
*|客户| 202 | 1 |销售人员2|
*|客户| 0 |销售人员1|
* +------------+---------+------------+
*/

非常感谢。。。。。。但它没有正确计算发生率。我的所有数据都在表视图中。。。。请建议。对于occurance，它显示0表示所有。我认为join中存在一些问题。检查

processedDF.join（df2，Seq（“Customer”），“left”）

我在答案中的任何地方都没有使用split。假设

SalesPerson

为df1，

saleplace

为df2，则可以使用我的回答中的

Unpivot和left join

部分来获得答案。在SalesPerson列中，它显示的值为col。。。。col…col…而不是销售人员1…2等。。。。请分享您的想法。您使用的数据集是否与问题中给出的列名相同？如果是的话，它应该可以工作。如果有任何更改，请更新说明

Error:
The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 54 aliases but got Salesperson,Customer ;