Python 如何从csv文件优雅地创建pyspark数据帧并将其转换为Pandas数据帧？_Python_Pandas_Pyspark

Python 如何从csv文件优雅地创建pyspark数据帧并将其转换为Pandas数据帧？

python pandas pyspark

Python 如何从csv文件优雅地创建pyspark数据帧并将其转换为Pandas数据帧？,python,pandas,pyspark,Python,Pandas,Pyspark,我有一个CSV文件，我想读入RDD或数据帧。到目前为止，这是可行的，但如果我收集数据并将其转换为用于打印的数据框，则表的格式是“错误的” 以下是我如何读取CSV文件： NUMERIC_DATA_FILE = os.path.join(DATA_DIR, "train_numeric.csv") numeric_rdd = sc.textFile(NUMERIC_DATA_FILE) numeric_rdd = numeric_rdd.mapPartitions(lambda x: csv.rea

我有一个CSV文件，我想读入RDD或数据帧。到目前为止，这是可行的，但如果我收集数据并将其转换为用于打印的数据框，则表的格式是“错误的”

以下是我如何读取CSV文件：

NUMERIC_DATA_FILE = os.path.join(DATA_DIR, "train_numeric.csv")
numeric_rdd = sc.textFile(NUMERIC_DATA_FILE)
numeric_rdd = numeric_rdd.mapPartitions(lambda x: csv.reader(x, delimiter=","))
numeric_df = sqlContext.createDataFrame(numeric_rdd)
numeric_df.registerTempTable("numeric")

结果如下所示：

是否有一种简单的方法可以正确地将CSV数据的第一行设置为列，并将第一列设置为索引

当我尝试从

数据帧中选择数据时，这个问题会进一步恶化：
numeric_df.select("SELECT Id FROM numeric")

这给了我：
AnalysisException: u"cannot resolve 'SELECT Id FROM numeric' given input columns _799, _640, _963, _70, _364, _143, _167, 
_156, _553, _835, _780, _235, ...

为PySpark数据帧设置架构
您的PySpark数据帧未分配架构。您应该将代码替换为以下代码段：
from pyspark.sql.types import *
NUMERIC_DATA_FILE = sc.textFile(os.path.join(DATA_DIR, "train_numeric.csv"))

# Extract the header line
header = NUMERIC_DATA_FILE.first()

# Assuming that all the columns are numeric, let's create a new StructField for each column
fields = [StructField(field_name, FloatType(), True) for field_name in header]

现在，我们可以构建我们的模式
schema = StructType(fields)

# We have the remove the header from the textfile rdd

# Extracting the header (first line) from the RDD
dataHeader = NUMERIC_DATA_FILE.filter(lambda x: "Id" in x)

# Extract the data without headers. We can make use of the `subtract` function
dataNoHeader = NUMERIC_DATA_FILE.subtract(dataHeader)

numeric_temp_rdd = dataNoHeader.mapPartitions(lambda x: csv.reader(x, delimiter=","))

架构作为参数传入createDataFrame（）
函数
numeric_df = sqlContext.createDataFrame(numeric_temp_rdd,schema)
numeric_df.registerTempTable("numeric")

现在，如果要将此数据帧转换为熊猫数据帧，请使用toPandas（）
函数：
pandas_df = numeric_df.limit(5).toPandas()

以下声明也适用：
如果要使用纯SQL，则需要使用SQLContext查询表
为PySpark数据帧设置架构
您的PySpark数据帧未分配架构。您应该将代码替换为以下代码段：
from pyspark.sql.types import *
NUMERIC_DATA_FILE = sc.textFile(os.path.join(DATA_DIR, "train_numeric.csv"))

# Extract the header line
header = NUMERIC_DATA_FILE.first()

# Assuming that all the columns are numeric, let's create a new StructField for each column
fields = [StructField(field_name, FloatType(), True) for field_name in header]

现在，我们可以构建我们的模式
schema = StructType(fields)

# We have the remove the header from the textfile rdd

# Extracting the header (first line) from the RDD
dataHeader = NUMERIC_DATA_FILE.filter(lambda x: "Id" in x)

# Extract the data without headers. We can make use of the `subtract` function
dataNoHeader = NUMERIC_DATA_FILE.subtract(dataHeader)

numeric_temp_rdd = dataNoHeader.mapPartitions(lambda x: csv.reader(x, delimiter=","))

架构作为参数传入createDataFrame（）
函数
numeric_df = sqlContext.createDataFrame(numeric_temp_rdd,schema)
numeric_df.registerTempTable("numeric")

现在，如果要将此数据帧转换为熊猫数据帧，请使用toPandas（）
函数：
pandas_df = numeric_df.limit(5).toPandas()

以下声明也适用：
如果要使用纯SQL，则需要使用SQLContext查询表
尝试读取您的CSV文件。或者，您可以使用pd.read_CSV（…）
将您的CSV文件直接读取到Pandas DF，如果它适合您的RAM，这看起来像来自Bosch Production Line Performance（）kaggle竞赛的数据。您需要大量的RAM来容纳数据…@ShivamGaur 1,8TB应该可以：p尝试读取您的CSV文件。或者，您可以使用pd.read_CSV（…）
将您的CSV文件直接读取到Pandas DF，如果它适合您的RAM，这看起来像来自Bosch Production Line Performance（）kaggle竞赛的数据。你需要大量的RAM来容纳数据…@ShivamGaur 1,8TB应该可以：PHm，我不能让它运行。仍然得到select（）
调用和numeric\u df.take（5）
的错误，它抛出了一个TypeError
某个“不可损坏的类型：'list'”-我不知道这是在哪里发生的。numeric\u df.select（“Id”）
应该可以工作。已更正语法“^此样式类似于SQLAlchemy”。如果希望使用纯SQL语句，那么需要使用SQL上下文HM，我无法使其运行。仍然得到select（）
调用和numeric\u df.take（5）
的错误，它抛出了一个TypeError
某个“不可损坏的类型：'list'”-我不知道这是在哪里发生的。numeric\u df.select（“Id”）
应该可以工作。已更正语法“^此样式类似于SQLAlchemy”。如果希望使用纯SQL语句，则需要使用SQL上下文