Python 使用Spark加载CSV文件_Python_Csv_Apache Spark_Pyspark

Python 使用Spark加载CSV文件

python csv apache-spark pyspark

Python 使用Spark加载CSV文件,python,csv,apache-spark,pyspark,Python,Csv,Apache Spark,Pyspark,我是Spark的新手，我正在尝试使用Spark从文件中读取CSV数据。以下是我正在做的： sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() 我希望这个调用会给我一个文件前两列的列表，但我得到了这个错误： File "<ipython-input-60-73ea98550983>", line 1, in <lambda&

我是Spark的新手，我正在尝试使用Spark从文件中读取CSV数据。以下是我正在做的：

sc.textFile('file.csv')
    .map(lambda line: (line.split(',')[0], line.split(',')[1]))
    .collect()

我希望这个调用会给我一个文件前两列的列表，但我得到了这个错误：

File "<ipython-input-60-73ea98550983>", line 1, in <lambda>
IndexError: list index out of range

文件“”，第1行，在
索引器：列表索引超出范围

虽然我的CSV文件包含多个列。

您确定所有行至少有两列吗？你能试试这样的东西吗，只是检查一下

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()

或者，您可以打印罪犯（如果有）：

sc.textFile（“file.csv”）\
.map（lambda行：line.split（“，”））\
.filter（lambda行：len（line）现在，任何常规csv文件都有另一个选项：如下所示：
假设我们有以下上下文
sc = SparkContext
sqlCtx = SQLContext or HiveContext

首先，使用SparkContext将pyspark-csv.py分发给执行者
import pyspark_csv as pycsv
sc.addPyFile('pyspark_csv.py')

通过SparkContext读取csv数据并将其转换为数据帧
plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')
dataframe = pycsv.csvToDataFrame(sqlCtx, plaintext_rdd)

还有另一个选择，就是使用Pandas读取CSV文件，然后将Pandas数据帧导入Spark
例如：
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header
s_df = sql_sc.createDataFrame(pandas_df)

如果您的csv数据碰巧在任何字段中都不包含换行符，您可以使用textFile（）
加载数据并对其进行解析
import csv
import StringIO

def loadRecord(line):
    input = StringIO.StringIO(line)
    reader = csv.DictReader(input, fieldnames=["name1", "name2"])
    return reader.next()

input = sc.textFile(inputFile).map(loadRecord)

Spark 2.0.0+
您可以直接使用内置的csv数据源：
spark.read.csv(
    "some_input_file.csv", 
    header=True, 
    mode="DROPMALFORMED", 
    schema=schema
)

或
不包括任何外部依赖项
火花<2.0.0：
在一般情况下，手动解析绝非小事，我建议您：
确保Spark CSV包含在路径中（--packages
，--jars
，--driver class path
）
并按如下方式加载数据：
df = (
    sqlContext
    .read.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferschema", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv")
)

它可以处理加载、模式推断、删除格式错误的行，并且不需要将数据从Python传递到JVM
注意：
如果您知道模式，最好避免模式推断，并将其传递给DataFrameReader
。假设您有三列—integer、double和string：
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([
    StructField("A", IntegerType()),
    StructField("B", DoubleType()),
    StructField("C", StringType())
])

(
    sqlContext
    .read
    .format("com.databricks.spark.csv")
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv")
)

简单地用逗号拆分也会拆分字段中的逗号（例如，a，b，“1,2,3”，c
），因此不建议这样做。如果您想使用DataFrames API，这很好，但如果您想坚持使用基本Spark，您可以使用模块在基本Python中解析CSV：
#适用于Python2和Python3
导入csv
rdd=sc.textFile（“file.csv”）
rdd=rdd.mapPartitions（lambda x:csv.reader（x））

编辑：正如注释中提到的@muon，这将把标题视为任何其他行，因此您需要手动提取它。例如，header=rdd.first（）；rdd=rdd.filter（lambda x:x！=header）
（确保在过滤器计算之前不要修改标题
）.但在这一点上，您可能最好使用内置的csv解析器。
这与使用Pandas是一致的，但有一个重大的修改：如果您将数据分块读取到Pandas中，它应该更具可塑性。这意味着，您可以解析比Pandas作为单个文件实际处理的文件大得多的文件，并将其传递给Spark小尺寸。（这也回答了为什么人们会想使用Spark的评论，如果他们可以把所有东西都装到熊猫身上的话。）
如果要将csv作为数据帧加载，则可以执行以下操作：
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv') \
    .options(header='true', inferschema='true') \
    .load('sampleFile.csv') # this is your csv file

这对我来说很好。
如果数据集中有一行或多行的列数少于或多于2，则可能会出现此错误
我也是Pyspark的新手，正在尝试读取CSV文件。以下代码适用于我：
在这段代码中，我使用的数据集来自kaggle，链接是：
1.不提及模式：
from pyspark.sql import SparkSession  
scSpark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",")
sdfData.show()

现在检查列：
sdfData.columns
输出将是：
['InvoiceNo', 'StockCode','Description','Quantity', 'InvoiceDate', 'CustomerID', 'Country']

检查每列的数据类型：
sdfData.schema
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true)))

sdfData.schema

StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true)))

这将为数据框提供数据类型为StringType的所有列
2.使用模式：
如果您知道模式或希望更改上表中任何列的数据类型，请使用此选项（假设我有以下列，并且希望每个列都具有特定的数据类型）
现在检查每个列的数据类型的模式：
sdfData.schema
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true)))

sdfData.schema

StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true)))

编辑：我们也可以使用以下代码行，而无需明确提及模式：
sdfData = scSpark.read.csv("data.csv", header=True, inferSchema = True)
sdfData.schema

输出为：
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true)))

输出如下所示：
sdfData.show()

+---------+---------+--------------------+--------+--------------+----------+-------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|CustomerID|Country|
+---------+---------+--------------------+--------+--------------+----------+-------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|      2.55|  17850|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|      2.75|  17850|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|12/1/2010 8:26|      7.65|  17850|
|   536365|    21730|GLASS STAR FROSTE...|       6|12/1/2010 8:26|      4.25|  17850|
|   536366|    22633|HAND WARMER UNION...|       6|12/1/2010 8:28|      1.85|  17850|
|   536366|    22632|HAND WARMER RED P...|       6|12/1/2010 8:28|      1.85|  17850|
|   536367|    84879|ASSORTED COLOUR B...|      32|12/1/2010 8:34|      1.69|  13047|
|   536367|    22745|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22748|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22749|FELTCRAFT PRINCES...|       8|12/1/2010 8:34|      3.75|  13047|
|   536367|    22310|IVORY KNITTED MUG...|       6|12/1/2010 8:34|      1.65|  13047|
|   536367|    84969|BOX OF 6 ASSORTED...|       6|12/1/2010 8:34|      4.25|  13047|
|   536367|    22623|BOX OF VINTAGE JI...|       3|12/1/2010 8:34|      4.95|  13047|
|   536367|    22622|BOX OF VINTAGE AL...|       2|12/1/2010 8:34|      9.95|  13047|
|   536367|    21754|HOME BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21755|LOVE BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21777|RECIPE BOX WITH M...|       4|12/1/2010 8:34|      7.95|  13047|
+---------+---------+--------------------+--------+--------------+----------+-------+
only showing top 20 rows

使用spark.read.csv
时，我发现使用选项escape='''
和multiLine=True
可以为提供最一致的解决方案，并且根据我的经验，使用从Google Sheets导出的csv文件效果最好
就是
#set inferSchema=False to read everything as string
df = spark.read.csv("myData.csv", escape='"', multiLine=True,
     inferSchema=False, header=True)

这是Pypark
path="Your file path with file name"

df=spark.read.format("csv").option("header","true").option("inferSchema","true").load(path)

那你可以查一下
df.show(5)
df.count()

就是这样，一行只有一列，谢谢。最好使用内置的csv
库来处理所有转义，因为如果值中有逗号，简单地用逗号拆分是不起作用的。有很多工具可以解析csv，不要重新发明轮子。如果引号中有逗号，这段代码就会中断.csv解析比仅在处拆分更复杂，"
。这会中断逗号。这非常糟糕。如果OP能够在Pandas中加载数据，而不想在每个spark群集上安装或指定依赖项，那么他为什么要在spark上执行此操作……Panda在读取时允许文件分块，因此这里仍然有一个让Panda处理初始文件解析的用例。请参阅下面的代码.Ca的答案注意：Pandas处理列架构的方式也不同于spark，特别是当涉及空格时。更安全的做法是将csv作为字符串加载到每个列中。@WoodChopper您可以在spark中将Pandas用作UDF，不是吗？您不需要配置单元来使用数据帧。关于您的解决方案：a）不需要StringIO
csv
可以使用任何iterable b）\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。看看flatMap c）使用mapPartitions而不是在每一行初始化读卡器会更有效率：）谢谢
#set inferSchema=False to read everything as string
df = spark.read.csv("myData.csv", escape='"', multiLine=True,
     inferSchema=False, header=True)

path="Your file path with file name"

df=spark.read.format("csv").option("header","true").option("inferSchema","true").load(path)

df.show(5)
df.count()