pyspark使用pandas读取csv,如何保留标题
我正在使用pandas chunks功能阅读csv。它工作,除了我不能保留标题。有没有办法/选择这样做?以下是示例代码:pyspark使用pandas读取csv,如何保留标题,csv,pandas,apache-spark,pyspark,spark-dataframe,Csv,Pandas,Apache Spark,Pyspark,Spark Dataframe,我正在使用pandas chunks功能阅读csv。它工作,除了我不能保留标题。有没有办法/选择这样做?以下是示例代码: import pyspark import pandas as pd sc = pyspark.SparkContext(appName="myAppName") spark_rdd = sc.emptyRDD() # filename: csv file chunks = pd.read_csv(filename, chunksize=10000) for chunk i
import pyspark
import pandas as pd
sc = pyspark.SparkContext(appName="myAppName")
spark_rdd = sc.emptyRDD()
# filename: csv file
chunks = pd.read_csv(filename, chunksize=10000)
for chunk in chunks:
spark_rdd += sc.parallelize(chunk.values.tolist())
#print(chunk.head())
#print(spark_rdd.toDF().show())
#break
spark_df = spark_rdd.toDF()
spark_df.show()
我最终使用了databricks的spark csv
sc = pyspark.SparkContext()
sql = pyspark.SQLContext(sc)
df = sql.read.load(filename,
format='com.databricks.spark.csv',
header='true',
inferSchema='true')
试试这个:
import pyspark
import pandas as pd
sc = pyspark.SparkContext(appName="myAppName")
spark_rdd = sc.emptyRDD()
# Read ten rows to get column names
x = pd.read_csv(filename,nrows=10)
mycolumns = list(x)
# filename: csv file
chunks = pd.read_csv(filename, chunksize=10000)
for chunk in chunks:
spark_rdd += sc.parallelize(chunk.values.tolist())
spark_df = spark_rdd.map(lambda x:tuple(x)).toDF(mycolumns)
spark_df.show()
对于读取标题,
x=pd.read\u csv(filename,nrows=1)
就足够了吗?我同意它的任意性,如果您至少读取一行,那么读取1行、5行或10行实际上并不重要。