Python '；大型'；Pyspark数据帧写入拼花地板/转换为熊猫数据帧_Python_Pandas_Pyspark_Apache Spark Sql

Python '；大型'；Pyspark数据帧写入拼花地板/转换为熊猫数据帧

python pandas pyspark

Python '；大型'；Pyspark数据帧写入拼花地板/转换为熊猫数据帧,python,pandas,pyspark,apache-spark-sql,Python,Pandas,Pyspark,Apache Spark Sql,我正在尝试加入许多“小型csv”（1000多个文件，每行600万行）。我在fat节点上使用Pyspark（内存：128G，CPU:24核）。然而，当我试图把这个数据框写到拼花地板上时发生“堆栈溢出” sc=SparkContext.getOrCreate（conf=conf） sqlContext=sqlContext（sc） bg_f=getfiles（“./files”） SName=str（os.path.basename（bg_f[0]）.split（'.'）[0]） schema=St

我正在尝试加入许多“小型csv”（1000多个文件，每行600万行）。我在fat节点上使用Pyspark（内存：128G，CPU:24核）。然而，当我试图把这个数据框写到拼花地板上时发生“堆栈溢出”

sc=SparkContext.getOrCreate（conf=conf）
sqlContext=sqlContext（sc）
bg_f=getfiles（“./files”）
SName=str（os.path.basename（bg_f[0]）.split（'.'）[0]）
schema=StructType([
StructField（'CataID'，StringType（），True），
StructField（'Start_Block'，IntegerType（），True），
StructField（'End_Block'，IntegerType（），True），
StructField（BName，IntegerType（），True）
])
temp=sqlContext.read.csv（bg_f[0]，sep='\t'，header=False，schema=schema）
对于bg_f[1:]中的p：
SName=str（os.path.basename（p.split（'.'）[0]）
schema=StructType([
StructField（'CataID'，StringType（），True），
StructField（'Start_Block'，IntegerType（），True），
StructField（'End_Block'，IntegerType（），True），
StructField（BName，IntegerType（），True）
])
cur=sqlContext.read.csv（p，sep='\t'，header=False，schema=schema）
温度=温度连接（当前，
on=['CataID'、'Start_Block'、'End_Block']，
怎么办
温度=温度下降（'CataID'、'Start\u Block'、'End\u Block'）

发生这种情况的原因是您的join指令重复了行并占用内存：

temp.join(cur,
          on=['CataID', 'Start_Block', 'End_Block'],
          how='outer')

如果只保留列BName，为什么不在read.csv之后只选择此列

temp = sqlContext.read.csv(bg_f[0], sep='\t', header=False, schema=schema).select(BName)

然后，您可以使用：

temp = temp.union(cur)

而不是连接，并在末尾删除重复的行：

temp = temp.distinct()