Apache spark 将spark sql数据帧导出到csv时出错
为了理解如何在python中导出spark sql数据帧,我参考了以下链接 我的代码:Apache spark 将spark sql数据帧导出到csv时出错,apache-spark,pyspark,apache-spark-sql,spark-csv,Apache Spark,Pyspark,Apache Spark Sql,Spark Csv,为了理解如何在python中导出spark sql数据帧,我参考了以下链接 我的代码: df = sqlContext.createDataFrame(routeRDD, ['Consigner', 'AverageScore', 'Trips']) df.select('Consigner', 'AverageScore', 'Trips').write.format('com.databricks.spark.csv').options(header='true').save('file:/
df = sqlContext.createDataFrame(routeRDD, ['Consigner', 'AverageScore', 'Trips'])
df.select('Consigner', 'AverageScore', 'Trips').write.format('com.databricks.spark.csv').options(header='true').save('file:///opt/BIG-DATA/VisualCargo/output/top_consigner.csv')
我使用spark submit加载作业,并在主url上传递以下JAR
spark-csv_2.11-1.5.0.jar, commons-csv-1.4.jar
我得到以下错误
df.select('Consigner', 'AverageScore', 'Trips').write.format('com.databricks.spark.csv').options(header='true').save('file:///opt/BIG-DATA/VisualCargo/output/top_consigner.csv')
File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 332, in save
File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o156.save.
py4j.protocol.Py4JJavaError: An error occurred while calling o156.save.
: java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at com.databricks.spark.csv.util.CompressionCodecs$.<init>(CompressionCodecs.scala:29)
at com.databricks.spark.csv.util.CompressionCodecs$.<clinit>(CompressionCodecs.scala)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:198)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:170)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Spark版本1.5.0-cdh5.5.1使用Scala 2.10构建-Spark<2.0的默认Scala版本。spark csv是使用Scala 2.10-spark-csv_2.11-1.5.0.jar构建的
请将spark csv更新为Scala 2.10版本或将spark更新为Scala 2.11。在artifactId之后,您将通过数字知道Scala版本,即spark-csv_2.10-1.5.0将用于Scala 2.10我在Windows上运行spark,我面临着无法写入fileCSV或Parquet的类似问题。在阅读Spark网站的更多内容后,我发现以下错误,这是因为我使用的是winutils版本。我把它改成64位,它工作了。希望这能帮助一些人。
在我看来,这是一场激烈的冲突。可能是CSV编写器的某些依赖项。@LiMuBei Scala版本冲突Spark版本:版本1.5.0-cdh5.5。1@Hardik是的,所以这是一场冲突。请将降级spark csv版本更新为2.10版本-@Hardik顺序必须不同-首先重新分区,然后写入,正如我在上一节中所述comment@HardikHadoopFileFormat就是这样工作的,Spark用它来写文件。在文件夹中,每个文件夹将有一个文件partition@Hardik如果将其写入普通存储器,则可以使用标准Java文件API或mv命令。在HDFS中,您可以使用HDFS dfs-mv或Hadoop文件API