Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/369.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 基于另一个数据集的值向数据集添加列_Java_Apache Spark_Apache Spark Sql - Fatal编程技术网

Java 基于另一个数据集的值向数据集添加列

Java 基于另一个数据集的值向数据集添加列,java,apache-spark,apache-spark-sql,Java,Apache Spark,Apache Spark Sql,我有一个数据集dsCustomer,其中包含带有列的客户详细信息 |customerID|idpt | totalAmount| |customer1 | H1 | 250 | |customer2 | H2 | 175 | |customer3 | H3 | 4000 | |customer4 | H3 | 9000 | 我有另一个数据集dsCategory,它包含基于销售额的类别 |categoryID|idpt | borne_

我有一个数据集dsCustomer,其中包含带有列的客户详细信息

|customerID|idpt | totalAmount|
|customer1 | H1  |    250     |
|customer2 | H2  |    175     |
|customer3 | H3  |    4000    |
|customer4 | H3  |    9000    |
我有另一个数据集dsCategory,它包含基于销售额的类别

|categoryID|idpt | borne_min|borne_max|
|A         |  H2 | 0        |1000     |
|B         |  H2 | 1000     |5000     |
|C         |  H2 | 5000     |7000     |
|D         |  H2 | 7000     |10000    |
|F         |  H3 | 0        |1000     |
|G         |  H3 | 1000     |5000     |
|H         |  H3 | 5000     |7000     |
|I         |  H3 | 7000     |1000000  |


我想有一个结果,这是采取总金额的客户,并找到类别

|customerID|idpt |totalAmount|category|
|customer1 | H1  |   250     | null   |
|customer2 | H2  |   175     | A      |
|customer3 | H3  |   4000    | G      |
|customer4 | H3  |   9000    | I      |
//udf
公共静态列getCategoryAmount(数据集ds,列数量列){
返回ds.filter(amountColumn.geq(col(“borne_min”))
和(amountColumn.lt(col(“borne_max”)).first().getAs(“categoryID”);
}
//将列添加到我的数据集的代码
dsCustomer.withColumn(“category”,getCategoryAmount(dsCategory,dsCustomer.col(“totalAmount”));
如何将客户数据集中的列值传递给UDF函数

因为错误显示类别数据集中不包含totalAmount

问题:如何使用Map为dsCustomer中的每一行检查dsCategory中的值

我已经尝试连接这两个表,但它起作用了,因为dsCustomer应该维护刚刚添加了从dsCategory中选择的计算列的相同记录

原因:org.apache.spark.sql.AnalysisException:无法解析给定输入列[categoryID,borne_min,borne_max];;

'过滤器('totalAmount>=borne#min220)&&('totalAmount

您必须将这两个数据集连接起来
withColumn
仅允许修改同一数据集

更新 我以前没有时间详细解释我的意思。这就是我想解释的。您可以连接两个数据帧。在您的情况下,需要一个左连接来保留没有匹配类别的行

from pyspark.sql import SparkSession


spark = SparkSession.builder.getOrCreate()

cust = [
    ('customer1', 'H1', 250), 
    ('customer2', 'H2', 175), 
    ('customer3', 'H3', 4000),
    ('customer4', 'H3', 9000)
]

cust_df = spark.createDataFrame(cust, ['customerID', 'idpt', 'totalAmount'])

cust_df.show()

cat = [
    ('A', 'H2', 0   , 1000),
    ('B', 'H2', 1000, 5000),
    ('C', 'H2', 5000, 7000),
    ('D', 'H2', 7000, 10000),
    ('F', 'H3', 0   , 1000),
    ('G', 'H3', 1000, 5000),
    ('H', 'H3', 5000, 7000),
    ('I', 'H3', 7000, 1000000)
]

cat_df = spark.createDataFrame(cat, ['categoryID', 'idpt', 'borne_min', 'borne_max'])

cat_df.show()

cust_df.join(cat_df, 
             (cust_df.idpt == cat_df.idpt) & 
             (cust_df.totalAmount >= cat_df.borne_min) & 
             (cust_df.totalAmount <= cat_df.borne_max)
            , how='left') \
.select(cust_df.customerID, cust_df.idpt, cust_df.totalAmount, cat_df.categoryID) \
.show()

我无法加入这两个数据集,因为它们没有关系键。如果我只是加入它们,它将产生笛卡尔结果。您应该在
amountColumn.geq(col(“borne\u min”))和(amountColumn.lt(col(“borne\u max”))上加入。我已经更新了问题。我在这里发布它是因为join的解决方案不适用于2个表。请查看我的更新答案。如果你喜欢,请投票。@brunelfabricetoupiopi很高兴它能为你工作!如果你真的接受它,把标记看作是被接受的答案:
+----------+----+-----------+
|customerID|idpt|totalAmount|
+----------+----+-----------+
| customer1|  H1|        250|
| customer2|  H2|        175|
| customer3|  H3|       4000|
| customer4|  H3|       9000|
+----------+----+-----------+

+----------+----+---------+---------+
|categoryID|idpt|borne_min|borne_max|
+----------+----+---------+---------+
|         A|  H2|        0|     1000|
|         B|  H2|     1000|     5000|
|         C|  H2|     5000|     7000|
|         D|  H2|     7000|    10000|
|         F|  H3|        0|     1000|
|         G|  H3|     1000|     5000|
|         H|  H3|     5000|     7000|
|         I|  H3|     7000|  1000000|
+----------+----+---------+---------+

+----------+----+-----------+----------+
|customerID|idpt|totalAmount|categoryID|
+----------+----+-----------+----------+
| customer1|  H1|        250|      null|
| customer3|  H3|       4000|         G|
| customer4|  H3|       9000|         I|
| customer2|  H2|        175|         A|
+----------+----+-----------+----------+