如何在pyspark中读取csv文件?

如何在pyspark中读取csv文件?,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我正在尝试使用pyspark读取csv文件,但它显示了一些错误。 您能告诉我读取csv文件的正确过程是什么吗 python代码: from pyspark.sql import * df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True) 我也试过下面一个: sqlContext = SQLContext df = sqlContext.lo

我正在尝试使用pyspark读取csv文件,但它显示了一些错误。 您能告诉我读取csv文件的正确过程是什么吗

python代码:

from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
我也试过下面一个:

sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
错误:

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
NameError: name 'spark' is not defined

and

Traceback (most recent call last):
      File "<pyshell#26>", line 1, in <module>
        df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
    AttributeError: type object 'SQLContext' has no attribute 'load'
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
df=spark.read.csv(“D:\Users\SPate233\Downloads\iMedical\query1.csv”,inferSchema=True,header=True)
NameError:未定义名称“spark”
和
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
df=sqlContext.load(source=“com.databricks.spark.csv”,header=“true”,path=“D:\Users\SPate233\Downloads\iMedical\query1.csv”)
AttributeError:类型对象“SQLContext”没有属性“load”

pyspark中最简单的csv读取方法-使用Databrick的spark csv模块

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
还可以按字符串读取并解析到分隔符

reader = sc.textFile("file.csv").map(lambda line: line.split(","))

首先,您需要创建一个SparkSession,如下所示

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()
您的csv需要在hdfs上,然后您可以使用spark.csv

df = spark.read.csv('/tmp/data.csv', header=True)

hdfs上的/tmp/data.csv所在位置也存在导入问题。请看,我收到了此错误-AttributeError:“property”对象没有属性“format”