Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 左反加入火花?_Scala_Apache Spark - Fatal编程技术网

Scala 左反加入火花?

Scala 左反加入火花?,scala,apache-spark,Scala,Apache Spark,我定义了如下两个表: val tableName = "table1" val tableName2 = "table2" val format = new SimpleDateFormat("yyyy-MM-dd") val data = List( List("mike", 26, true), List("susan", 26, false), List("john", 33, true) )

我定义了如下两个表:

 val tableName = "table1"
    val tableName2 = "table2"

    val format = new SimpleDateFormat("yyyy-MM-dd")
      val data = List(
        List("mike", 26, true),
        List("susan", 26, false),
        List("john", 33, true)
      )
    val data2 = List(
        List("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)),
        List("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)),
        List("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)),
        List("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)),
        List("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime))
      )

      val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
      val rdd2 = sparkContext.parallelize(data2).map(Row.fromSeq(_))
      val schema = StructType(Array(
        StructField("name", StringType, true),
        StructField("age", IntegerType, true),
        StructField("isBoy", BooleanType, false)
      ))
    val schema2 = StructType(Array(
        StructField("name", StringType, true),
        StructField("grade", StringType, true),
        StructField("howold", IntegerType, true),
        StructField("hobby", StringType, true),
        StructField("birthday", DateType, false)
      ))

      val df = sqlContext.createDataFrame(rdd, schema)
      val df2 = sqlContext.createDataFrame(rdd2, schema2)
      df.createOrReplaceTempView(tableName)
      df2.createOrReplaceTempView(tableName2)
我正在尝试构建查询,以返回表1中表2中没有匹配行的行。 我已尝试使用以下查询进行此操作:

Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold AND table2.name IS NULL AND table2.howold IS NULL
但这只给出了表1中的所有行:

列表({“姓名”:“约翰”,“年龄”:33,“isBoy”:真的}, {“name”:“susan”,“age”:26,“isBoy”:false}, {“姓名”:“迈克”,“年龄”:26,“isBoy”:真的})

如何使这种类型的连接有效地加入Spark


我正在寻找SQL查询,因为我需要能够指定要在两个表之间比较的列,而不是像在其他推荐问题中那样逐行比较。类似于使用减法,除此之外等。

您可以使用内置函数
except
(我本来会使用你提供的代码,但你没有包括导入,所以我不能只是c/p它:()

您可以将“left anti”联接类型用于DataFrame API或SQL(DataFrame API支持SQL支持的一切,包括您需要的任何联接条件):

数据帧API:

df.as("table1").join(
  df2.as("table2"),
  $"table1.name" === $"table2.name" && $"table1.age" === $"table2.howold",
  "leftanti"
)
SQL:

注意:还值得注意的是,有一种更短、更简洁的方法可以创建样本数据,而无需单独指定模式,使用元组和隐式
toDF
方法,然后在需要时“修复”自动推断的模式:

import spark.implicits._
val df = List(
  ("mike", 26, true),
  ("susan", 26, false),
  ("john", 33, true)
).toDF("name", "age", "isBoy")

val df2 = List(
  ("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)),
  ("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)),
  ("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)),
  ("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)),
  ("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime))
).toDF("name", "grade", "howold", "hobby", "birthday").withColumn("birthday", $"birthday".cast(DateType))

在SQL中,您可以简单地将查询转到下面(不确定它是否在SPARK中工作)

这将返回表1中联接失败的所有行

您可以使用left anti

dfRcc20.as("a").join(dfClientesDuplicados.as("b")
  ,col("a.eteerccdiid")===col("b.eteerccdiid")&&
    col("a.eteerccdinr")===col("b.eteerccdinr")
  ,"left_anti")
数据集spark java中的左反连接:
左反联接返回第一个数据集中与第二个数据集中不匹配的所有行。
代码示例:
/*从Employee.csv读取数据*/
Dataset employee=sparkSession.read().选项(“标头”、“true”)
.csv(“C:\\Users\\Desktop\\Spark\\Employee.csv”);
employee.show();
/*从Employee1.csv读取数据*/
Dataset employee1=sparkSession.read().选项(“标头”、“true”)
.csv(“C:\\Users\\Desktop\\Spark\\Employee1.csv”);
雇员1.show();
/*应用左反联接*/
数据集leftAntiJoin=employee.join(employee1,employee.col(“name”).equalTo(employee1.col(“name”),“leftanti”);
leftAntiJoin.show();
输出:
1) 员工数据集
+-------+--------+-------+
|姓名|地址|工资|
+-------+--------+-------+
|阿伦|印多尔| 500|
|Shubham | Indore | 1000|
|穆克什|哈里亚纳| 10000|
|坎哈博帕尔100000|
|南丹|贾巴尔普| 1000000|
|拉朱|罗塔克| 1000000|
+-------+--------+-------+
2) 雇员1数据集
+-------+--------+------+
|姓名|地址|工资|
+-------+--------+------+
|阿伦|印多尔| 500|
|Shubham | Indore | 1000|
|穆克什|哈里亚纳| 10000|
+-------+--------+------+
3) 应用的leftanti-join和final数据
+------+--------+-------+
|姓名|地址|工资|
+------+--------+-------+
|坎哈博帕尔100000|
|南丹|贾巴尔普| 1000000|
|拉朱|罗塔克| 1000000|
+------+--------+-------+

我正在寻找SQL查询,因为我需要能够指定要在两个表之间进行比较的列,而不仅仅是逐行比较基于您对我的答案的编辑和注释的可能重复,我想您正在寻找:值得注意的是@Interfector对第一个答案d#cogroup的评论应该有效另一个例子请尝试下面的查询。“从table1左侧选择*连接table1.name=table2.name和table1.age=table2.howold上的table2,其中table2.name为空,table2.howold为空”这将不起作用。where子句在联接操作之前应用,因此不会产生预期效果。@Hafthor这将起作用,并且通常优化器会生成相同的左反联接计划。
import spark.implicits._
val df = List(
  ("mike", 26, true),
  ("susan", 26, false),
  ("john", 33, true)
).toDF("name", "age", "isBoy")

val df2 = List(
  ("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)),
  ("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)),
  ("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)),
  ("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)),
  ("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime))
).toDF("name", "grade", "howold", "hobby", "birthday").withColumn("birthday", $"birthday".cast(DateType))
Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold where table2.name IS NULL 
dfRcc20.as("a").join(dfClientesDuplicados.as("b")
  ,col("a.eteerccdiid")===col("b.eteerccdiid")&&
    col("a.eteerccdinr")===col("b.eteerccdinr")
  ,"left_anti")
Left Anti Join in dataset spark java:

A left anti join returns that all rows from the first dataset which do not have a match in the second dataset.

Example with code:

/*Read data from Employee.csv */
Dataset<Row> employee = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\Desktop\\Spark\\Employee.csv");
employee.show();

/*Read data from Employee1.csv */
Dataset<Row> employee1 = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\Desktop\\Spark\\Employee1.csv");
employee1.show();

/*Apply left anti join*/
Dataset<Row> leftAntiJoin = employee.join(employee1, employee.col("name").equalTo(employee1.col("name")), "leftanti");

leftAntiJoin.show();

Output:

1) Employee dataset
+-------+--------+-------+
|   name| address| salary|
+-------+--------+-------+
|   Arun|  Indore|    500|
|Shubham|  Indore|   1000|
| Mukesh|Hariyana|  10000|
|  Kanha|  Bhopal| 100000|
| Nandan|Jabalpur|1000000|
|   Raju|  Rohtak|1000000|
+-------+--------+-------+

2) Employee1 dataset
+-------+--------+------+
|   name| address|salary|
+-------+--------+------+
|   Arun|  Indore|   500|
|Shubham|  Indore|  1000|
| Mukesh|Hariyana| 10000|
+-------+--------+------+

3) Applied leftanti join and final data
+------+--------+-------+
|  name| address| salary|
+------+--------+-------+
| Kanha|  Bhopal| 100000|
|Nandan|Jabalpur|1000000|
|  Raju|  Rohtak|1000000|
+------+--------+-------+