在Spark-Scala中读取CSV文件时出错
我正在尝试使用CSV阅读器API在Spark中读取CSV文件。我当前遇到数组索引越界异常 验证: 将我尝试的代码放在下面。 预期结果-dataFrame.show() 实际误差-在Spark-Scala中读取CSV文件时出错,scala,csv,apache-spark,apache-spark-sql,Scala,Csv,Apache Spark,Apache Spark Sql,我正在尝试使用CSV阅读器API在Spark中读取CSV文件。我当前遇到数组索引越界异常 验证: 将我尝试的代码放在下面。 预期结果-dataFrame.show() 实际误差- 19/03/28 10:42:51 INFO FileScanRDD: Reading File path: file:///C:/Users/testing/workspace_xxxx/abc_Reports/src/test/java/report1.csv, range: 0-10542, partition
19/03/28 10:42:51 INFO FileScanRDD: Reading File path: file:///C:/Users/testing/workspace_xxxx/abc_Reports/src/test/java/report1.csv, range: 0-10542, partition values: [empty row]
19/03/28 10:42:51 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
java.lang.ArrayIndexOutOfBoundsException: 63
at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
输入数据::
A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|AA|BB|CC|DD|EE|FF|GG|HH|II|JJ|KK|LL|MM|NN|OO|PP|QQ|RR|SS|TT|UU|VV|WW|XX|YY|ZZ|TGHJ|HG|EEE|ASD|EFFDCLDT|QSAS|WWW|DATIME|JOBNM|VFDCXS|REWE|XCVVCX|ASDFF
QW|8|2344|H02|1002| |1|2019-01-20|9999-12-31| |EE|2014-01-20|2014-01-20|2014-01-20|CNB22345 |IN|9|1234444| | | |10|QQ|8|BMX10290M|EWR| |.000000000|00|M |2027-01-20|2027-01-20| |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25| | | | |RE|WW| |RQ| | | | | | | | |1901-01-01|0|SED2233345 |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:34.823000000| |
您可以使用
com.databricks.spark.csv
读取csv文件。请查找下面的示例代码
import org.apache.spark.sql.SparkSession
object SparkCSVTest extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.option("inferSchema", "false")
.load("C:/Users/KZAPAGOL/Desktop/CSV/csvSample.csv")
df.show()
}
使用的CSV文件:
A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|AA|BB|CC|DD|EE|FF|GG|HH|II|JJ|KK|LL|MM|NN|OO|PP|QQ|RR|SS|TT|UU|VV|WW|XX|YY|ZZ|TGHJ|HG|EEE|ASD|EFFDCLDT|QSAS|WWW|DATIME|JOBNM|VFDCXS|REWE|XCVVCX|ASDFF
QW|8|2344|H02|1002| |1|2019-01-20|9999-12-31| |EE|2014-01-20|2014-01-20|2014-01-20|CNB22345 |IN|9|1234444| | | |10|QQ|8|BMX10290M|EWR| |.000000000|00|M |2027-01-20|2027-01-20| |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25| | | | |RE|WW| |RQ| | | | | | | | |1901-01-01|0|SED2233345 |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:34.823000000| |
带标题:
+---+---+----+---+----+--------------+---+----------+----------+---+---+----------+----------+----------+--------------------+---+---+-------+---+--------+--------+---+---+---+---------+---+---+----------+---+---+----------+----------+---+---+---+---+---+----------+----------+-------+---+----------+----------+------+---+---+---+---+---+---+---+--------+-----+--------+---+---+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
| A| B| C| D| E| F| G| H| I| J| K| L| M| N| O| P| Q| R| S| T| U| V| W| X| Y| Z| AA| BB| CC| DD| EE| FF| GG| HH| II| JJ| KK| LL| MM| NN| OO| PP| QQ| RR| SS| TT| UU| VV| WW| XX| YY| ZZ| TGHJ| HG|EEE|ASD| EFFDCLDT|QSAS| WWW| DATIME| JOBNM| VFDCXS| REWE| XCVVCX|ASDFF|
+---+---+----+---+----+--------------+---+----------+----------+---+---+----------+----------+----------+--------------------+---+---+-------+---+--------+--------+---+---+---+---------+---+---+----------+---+---+----------+----------+---+---+---+---+---+----------+----------+-------+---+----------+----------+------+---+---+---+---+---+---+---+--------+-----+--------+---+---+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
| QW| 8|2344|H02|1002| | 1|2019-01-20|9999-12-31| | EE|2014-01-20|2014-01-20|2014-01-20|CNB22345 | IN| 9|1234444| | | | 10| QQ| 8|BMX10290M|EWR| |.000000000| 00| M |2027-01-20|2027-01-20| |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25| | | | | RE| WW| | RQ| | | | | | | | |1901-01-01| 0|SED2233345 |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:...| | null|
+---+---+----+---+----+--------------+---+----------+----------+---+---+----------+----------+----------+--------------------+---+---+-------+---+--------+--------+---+---+---+---------+---+---+----------+---+---+----------+----------+---+---+---+---+---+----------+----------+-------+---+----------+----------+------+---+---+---+---+---+---+---+--------+-----+--------+---+---+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
无标题:
+---+---+----+---+----+--------------+---+----------+----------+---+----+----------+----------+----------+--------------------+----+----+-------+----+--------+--------+----+----+----+---------+----+----+----------+----+----+----------+----------+----+----+----+----+----+----------+----------+-------+----+----------+----------+------+----+----+----+----+----+----+----+--------+-----+--------+----+----+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
|_c0|_c1| _c2|_c3| _c4| _c5|_c6| _c7| _c8|_c9|_c10| _c11| _c12| _c13| _c14|_c15|_c16| _c17|_c18| _c19| _c20|_c21|_c22|_c23| _c24|_c25|_c26| _c27|_c28|_c29| _c30| _c31|_c32|_c33|_c34|_c35|_c36| _c37| _c38| _c39|_c40| _c41| _c42| _c43|_c44|_c45|_c46|_c47|_c48|_c49|_c50| _c51| _c52| _c53|_c54|_c55| _c56|_c57| _c58| _c59| _c60| _c61| _c62| _c63| _c64|
+---+---+----+---+----+--------------+---+----------+----------+---+----+----------+----------+----------+--------------------+----+----+-------+----+--------+--------+----+----+----+---------+----+----+----------+----+----+----------+----------+----+----+----+----+----+----------+----------+-------+----+----------+----------+------+----+----+----+----+----+----+----+--------+-----+--------+----+----+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
| A| B| C| D| E| F| G| H| I| J| K| L| M| N| O| P| Q| R| S| T| U| V| W| X| Y| Z| AA| BB| CC| DD| EE| FF| GG| HH| II| JJ| KK| LL| MM| NN| OO| PP| QQ| RR| SS| TT| UU| VV| WW| XX| YY| ZZ| TGHJ| HG| EEE| ASD| EFFDCLDT|QSAS| WWW| DATIME| JOBNM| VFDCXS| REWE| XCVVCX|ASDFF|
| QW| 8|2344|H02|1002| | 1|2019-01-20|9999-12-31| | EE|2014-01-20|2014-01-20|2014-01-20|CNB22345 | IN| 9|1234444| | | | 10| QQ| 8|BMX10290M| EWR| |.000000000| 00| M |2027-01-20|2027-01-20| | .00| .00| .00| .00|2014-01-20|1901-01-01|3423.25| | | | | RE| WW| | RQ| | | | | | | | |1901-01-01| 0|SED2233345 |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:...| | null|
+---+---+----+---+----+--------------+---+----------+----------+---+----+----------+----------+----------+--------------------+----+----+-------+----+--------+--------+----+----+----+---------+----+----+----------+----+----+----------+----------+----+----+----+----+----+----------+----------+-------+----+----------+----------+------+----+----+----+----+----+----+----+--------+-----+--------+----+----+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
构建.sbt
"com.databricks" %% "spark-csv" % "1.5.0",
"org.apache.spark" %% "spark-core" % "2.2.2",
"org.apache.spark" %% "spark-sql" % "2.2.2"
参考的屏幕截图:
希望有帮助 刚刚找到了确切的问题。
实际上,我试图读取的10个CSV文件是UTF-8格式的文件。这不是问题的根源。
在总共13个文件中,有3个文件是UCS-2格式的。因此,这些是导致CSV读取过程出现问题的原因。这些文件是导致上述错误的文件
UTF-8 ==> Unicode Transformation Format Encoding.
UCS-2 ==> Universal Coded Character Set Encoding.
由此,了解到databricks CSV read支持UTF编码,并导致UCS编码问题。因此,将文件保存为UTF-8格式并尝试读取该文件。它就像一个符咒
如果有更多的细节,请随时添加。您可以分享您正在使用的csv文件内容样本。@KZapagol-根据要求添加了样本数据@Dasarathy..我可以读取带有您的示例数据的csv文件。请查看我的更新注释。@KZapagol能否将标题设为true并重试?@Dasarathy。。我已经试过头顶,效果很好。请看我的最新评论。我已经试过同样的方法了。唯一的区别是inferschema部分。而且,同样的代码适用于其他CSV文件,完美无缺。@DasarathyDR我的答案适用于您吗?还面临任何问题吗?我发现我输入的业务数据本身存在一些差异。你的回答是对的。因为-一个特定的文件读取是我失败的地方。其余3个文件可以无缝工作。如果答案正确,请接受。我希望有帮助!感谢您可以使用
字符集
选项读取其他编码器的文件。例如,如果您想读取Shift JS编码器文件,则可以将charset选项设置为。选项(“charset”、“Shift JIS”)
。
"com.databricks" %% "spark-csv" % "1.5.0",
"org.apache.spark" %% "spark-core" % "2.2.2",
"org.apache.spark" %% "spark-sql" % "2.2.2"
UTF-8 ==> Unicode Transformation Format Encoding.
UCS-2 ==> Universal Coded Character Set Encoding.