Csv 两个Spark数据帧的简单连接失败;org.apache.spark.sql.AnalysisException:无法解析列名;

Csv 两个Spark数据帧的简单连接失败;org.apache.spark.sql.AnalysisException:无法解析列名;,csv,apache-spark,apache-spark-sql,spark-dataframe,Csv,Apache Spark,Apache Spark Sql,Spark Dataframe,更新 事实证明,这与DataRicks Spark CSV阅读器创建数据帧的方式有关。在下面不起作用的示例中,我使用DataRicks CSV reader读取人员和地址CSV,然后将生成的数据帧以拼花格式写入HDFS 我更改了代码以创建数据框:(类似于people.csv) 人的内容 地址内容 结果 root |-- first: string (nullable = true) |-- last: string (nullable = true) |-- addressid: in

更新 事实证明,这与DataRicks Spark CSV阅读器创建数据帧的方式有关。在下面不起作用的示例中,我使用DataRicks CSV reader读取人员和地址CSV,然后将生成的数据帧以拼花格式写入HDFS

我更改了代码以创建数据框:(类似于people.csv)


人的内容

地址内容


结果

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

address.printSchema();
root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)


DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)
结果

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

address.printSchema();
root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)


DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)
笛卡尔连接工作正常,printSchema()导致

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

address.printSchema();
root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)


DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)
这个加入

DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));
导致以下异常

Exception in thread "main" org.apache.spark.sql.AnalysisException: **Cannot resolve column name "addrid" among (addrid, city, state, zip);**
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
    at org.apache.spark.sql.DataFrame.col(DataFrame.scala:558)
    at dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36)
    at dw.dataflow.DataflowParser.main(DataflowParser.java:119)

我尝试更改它,以便人员和地址具有公共密钥属性(addressid)并使用

address.join(people, "addressid");
但是得到了同样的结果

有什么想法吗


谢谢

原来问题是CSV文件是UTF-8格式,带有BOM表。DataBricks CSV实现不处理带BOM的UTF-8。在没有BOM的情况下将文件转换为UTF-8,所有工作正常。

能够使用记事本++解决此问题。在“Encoding”(编码)菜单下,我将其从“在UTF-8 BOM中编码”切换到“在UTF-8中编码”。

您能解释一下此上下文中的BOM是什么吗?BOM是字节顺序标记