Java 在spark中访问多个数据集_Java_Apache Spark

Java 在spark中访问多个数据集

java apache-spark

Java 在spark中访问多个数据集,java,apache-spark,Java,Apache Spark,我有一个用例，我想使用另一个数据集的值。例如：表1：项目 Name | Price ------------ Apple |10 Mango| 20 Grape |30 Name | Code | Quantity ------------------- Apple-1 |APP | 10 Mango-1| MAN | 20 Grape-1|GRA | 30 Apple-2 |APP | 20 Mango-2| MAN | 30 Grape -2|GRA | 50 Table 2 :

我有一个用例，我想使用另一个数据集的值。例如：

表1：项目

Name | Price
------------
Apple |10

Mango| 20

Grape |30

Name | Code | Quantity
-------------------
Apple-1 |APP | 10
Mango-1| MAN | 20
Grape-1|GRA | 30
Apple-2 |APP | 20
Mango-2| MAN | 30
Grape -2|GRA | 50


Table 2 : Item_CODE_Price

Code | Price
----------------
APP |5
MAN| 2
GRA |2

I want to calculate total cost using code to get the price and prepare a final dataset.

Cost
Name | Cost
--------------
Apple-1 |50  (10*5)
Mango-1| 40  (20*2)
Grape-1 |60   (30*2)
Apple-2 |100  (20*5)
Mango-2| 60  (30*2)
Grape-2 |100   (50*2)

表2：项目数量

Name | Quantity
Apple |5
Mango| 2
Grape |2

我想计算总成本并准备最终数据集

Cost
Name | Cost
Apple |50  (10*5)
Mango| 40  (20*2)
Grape |60   (30*2)

如何在spark中实现这一点？谢谢你的帮助

===================

另一个用例：我也需要你的帮助

表1：项目

Name | Price
------------
Apple |10

Mango| 20

Grape |30

Name | Code | Quantity
-------------------
Apple-1 |APP | 10
Mango-1| MAN | 20
Grape-1|GRA | 30
Apple-2 |APP | 20
Mango-2| MAN | 30
Grape -2|GRA | 50


Table 2 : Item_CODE_Price

Code | Price
----------------
APP |5
MAN| 2
GRA |2

I want to calculate total cost using code to get the price and prepare a final dataset.

Cost
Name | Cost
--------------
Apple-1 |50  (10*5)
Mango-1| 40  (20*2)
Grape-1 |60   (30*2)
Apple-2 |100  (20*5)
Mango-2| 60  (30*2)
Grape-2 |100   (50*2)

您可以

join

两个具有相同

名称的表

，并使用

withColumn

创建一个新的

列

，如下所示

  val df1 = spark.sparkContext.parallelize(Seq(
    ("Apple",10),
    ("Mango",20),
    ("Grape",30)
  )).toDF("Name","Price" )


  val df2 = spark.sparkContext.parallelize(Seq(
    ("Apple",5),
    ("Mango",2),
    ("Grape",2)
  )).toDF("Name","Quantity" )


  //join and create new column
  val newDF = df1.join(df2, Seq("Name"))
    .withColumn("Cost", $"Price" * $"Quantity")

  newDF.show(false)

输出：

+-----+-----+--------+----+
|Name |Price|Quantity|Cost|
+-----+-----+--------+----+
|Grape|30   |2       |60  |
|Mango|20   |2       |40  |
|Apple|10   |5       |50  |
+-----+-----+--------+----+

第二种情况是，您只需要使用代码连接并删除您不希望在final中使用的列

val newDF = df2.join(df1, Seq("CODE"))
    .withColumn("Cost", $"Price" * $"Quantity")
    .drop("Code", "Price", "Quantity")

这个例子是在scala中，如果您需要java，则不会有太大差异

希望这有帮助

我猜输出中的（30*2）只是为了解释谢谢@Shankar Koirala。我需要更多的帮助。我已经更新了我的问题。你也能帮我吗？感谢您对此的回复。@user2034519不一样吗？请在第二种情况下使用

code

，您希望（10*5）在结果数据框中也使用此选项吗？不，在第二种情况下，我需要使用项目代码在第二个表（数据集）中查找价格。我不想再使用项目名称。。