Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/docker/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Windows(Spyder):如何使用pyspark读取csv文件_Apache Spark_Pyspark_Databricks - Fatal编程技术网

Apache spark Windows(Spyder):如何使用pyspark读取csv文件

Apache spark Windows(Spyder):如何使用pyspark读取csv文件,apache-spark,pyspark,databricks,Apache Spark,Pyspark,Databricks,我使用以下代码来读取使用pyspark的csv文件 import os import sys os.environ["SPARK_HOME"] = "D:\ProgramFiles\spark-2.1.0-bin-hadoop2.7" os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib" sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip") sys.

我使用以下代码来读取使用pyspark的csv文件

import os
import sys

os.environ["SPARK_HOME"] = "D:\ProgramFiles\spark-2.1.0-bin-hadoop2.7"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

conf = SparkConf() 
conf.setMaster('local') 
conf.setAppName('test')
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

df = qlContext.read.format("com.databricks.spark.csv").schema(customSchema).option("header", "true").option("mode", "DROPMALFORMED").load("iris.csv")

df.show()
抛出的错误如下:-

文件“”,第1行,在 df=sqlContext.read.format(“com.databricks.spark.csv”).schema(customSchema).option(“header”, “true”)。选项(“mode”,“dropmorformed”)。加载(“iris.csv”)

文件 “D:\ProgramFiles\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\context.py”, 第464行,已读 返回DataFrameReader(自身)

文件 “D:\ProgramFiles\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\readwriter.py”,第70行,在init self.\u jreader=spark.\u ssql\u ctx.read()

文件 “D:\ProgramFiles\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py”, 第1133行,在调用中 回答,self.gateway\u客户端,self.target\u id,self.name)

文件 “D:\ProgramFiles\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py”, 第79行,装饰风格 引发IllegalArgumentException(s.split(“:”,1)[1],stackTrace)

IllegalArgumentException:“实例化时出错 'org.apache.spark.sql.internal.SessionState':


上述读取csv的方法适用于spark版本<2.0.0

对于火花>2.0.0 您需要使用spark会话进行阅读,如中所示

spark.read.csv("some_file.csv", header=True, mode="DROPMALFORMED", schema=schema)


#这是丢失的代码。虹膜是绝对路径。customSchema=StructType([\StructField(“Sepal.Length”,DoubleType(),True),\StructField(“Sepal.Width”,DoubleType(),True),\StructField(“Petal.Length”,DoubleType(),True)”)df=sqlContext.read.format(“com.databricks.spark.csv”).schema(customSchema)。选项(“header”、“true”)。选项(“mode”、“DROPMALFORMED”)。加载(“d:\iris.csv”)请不要使用注释空间添加代码或其他详细信息-改为编辑和更新问题
(spark.read
 .schema(schema)
 .option("header", "true")
 .option("mode", "DROPMALFORMED")
 .csv("some_file.csv"))