Csv 两个Spark数据帧的简单连接失败;org.apache.spark.sql.AnalysisException:无法解析列名;
更新 事实证明,这与DataRicks Spark CSV阅读器创建数据帧的方式有关。在下面不起作用的示例中,我使用DataRicks CSV reader读取人员和地址CSV,然后将生成的数据帧以拼花格式写入HDFS 我更改了代码以创建数据框:(类似于people.csv)Csv 两个Spark数据帧的简单连接失败;org.apache.spark.sql.AnalysisException:无法解析列名;,csv,apache-spark,apache-spark-sql,spark-dataframe,Csv,Apache Spark,Apache Spark Sql,Spark Dataframe,更新 事实证明,这与DataRicks Spark CSV阅读器创建数据帧的方式有关。在下面不起作用的示例中,我使用DataRicks CSV reader读取人员和地址CSV,然后将生成的数据帧以拼花格式写入HDFS 我更改了代码以创建数据框:(类似于people.csv) 人的内容 地址内容 结果 root |-- first: string (nullable = true) |-- last: string (nullable = true) |-- addressid: in
人的内容 地址内容
结果
root
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
address.printSchema();
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
结果
root
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
address.printSchema();
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
笛卡尔连接工作正常,printSchema()导致
root
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
address.printSchema();
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
这个加入
DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));
导致以下异常
Exception in thread "main" org.apache.spark.sql.AnalysisException: **Cannot resolve column name "addrid" among (addrid, city, state, zip);**
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:558)
at dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36)
at dw.dataflow.DataflowParser.main(DataflowParser.java:119)
我尝试更改它,以便人员和地址具有公共密钥属性(addressid)并使用
address.join(people, "addressid");
但是得到了同样的结果
有什么想法吗
谢谢原来问题是CSV文件是UTF-8格式,带有BOM表。DataBricks CSV实现不处理带BOM的UTF-8。在没有BOM的情况下将文件转换为UTF-8,所有工作正常。能够使用记事本++解决此问题。在“Encoding”(编码)菜单下,我将其从“在UTF-8 BOM中编码”切换到“在UTF-8中编码”。您能解释一下此上下文中的BOM是什么吗?BOM是字节顺序标记