Python 使用pandas和dask合并具有不同模式的拼花地板文件_Python_Pandas_Dask_Parquet_Pyarrow

Python 使用pandas和dask合并具有不同模式的拼花地板文件

python pandas dask

Python 使用pandas和dask合并具有不同模式的拼花地板文件,python,pandas,dask,parquet,pyarrow,Python,Pandas,Dask,Parquet,Pyarrow,我有一个大约1000个文件的拼花目录，模式不同。我想通过文件重新分区将所有这些文件合并到最佳数量的文件中。我使用pandas和pyarrow从目录中读取每个分区文件，连接所有数据帧，并将其作为一个文件写入使用这种方法，当数据大小增加时，我会遇到内存问题并被杀死。所以我选择了另一种方法来完成这个过程我首先阅读了一堆文件，使用concat合并并写入新的拼花地板目录。类似地，第二次，我读取了第二组文件，将它们连接为单个数据帧，并从第二个合并的数据帧中获取了一条记录。现在我有一条来自第二个合并数据帧

我有一个大约1000个文件的拼花目录，模式不同。我想通过文件重新分区将所有这些文件合并到最佳数量的文件中。我使用pandas和pyarrow从目录中读取每个分区文件，连接所有数据帧，并将其作为一个文件写入

使用这种方法，当数据大小增加时，我会遇到内存问题并被杀死。所以我选择了另一种方法来完成这个过程

我首先阅读了一堆文件，使用concat合并并写入新的拼花地板目录。类似地，第二次，我读取了第二组文件，将它们连接为单个数据帧，并从第二个合并的数据帧中获取了一条记录。现在我有一条来自第二个合并数据帧的记录，我再次从文件中读取第一个合并数据帧，并将其与来自第二个合并数据帧的记录合并。然后我使用dask来创建拼花地板，附加功能将新文件添加到拼花地板文件夹中

这是一个有效的拼花文件吗？当我们从这个拼花地板读取数据时，我会得到所有类似拼花地板模式演化的列吗？它是否类似于spark合并模式

更新：

sample.parquet - contains 1000 part files

def read_files_from_path(inputPath):
   return {"inputPath": ["part-001","part-002",...,"part-100"]}


def mergeParquet(list_of_files,output_path)
   dfs_list = []
   for i in range:
      df = pd.read_parquet(i, engine='pyarrow')
      dfs_list.append(df)
   df = pd.concat(dfs_list,axis=0,sort=True)
   df_sample_record_df = df[2:3]

   if os.path.exists(output_path + '/_metadata'):
      files_in_output_path = getFiles(output_path)
      for f in files_in_output_path:
         temp_df = pd.read_parquet(f, engine='pyarrow')
         temp_combine_df = pd.concat(temp_df,df_sample_record_df) 
         temp_combine_df.repartition(partition_size="128MB") \
                .to_parquet(output_path+"/tmp",engine='pyarrow',
                            ignore_divisions=True,append=True)
         os.remove(output_path+"/"+each_file)
   return df

def final_write_parquet(df,output_path):
   if os.path.exists(output_path+"/tmp"):
      df.repartition(partition_size="128MB")\
              .to_parquet(output_path+str(self.temp_dir),engine='pyarrow',
                            ignore_divisions=True,append=True)
      files = os.listdir(output_path + "/tmp")
      for f in files:
         shutil.move(output_path+"/tmp"+"/"+f, output_path)
         shutil.rmtree(output_path+"/tmp")
   else:
      df.repartition(partition_size="128MB")\
                .to_parquet(output_path, engine='pyarrow', append=False)


if __name__ == "__main__":
   files_dict = read_files_from_path(inputPath)
   number_of_batches = 1000/500    # total files/batchsize
   for sub_file_names in np.array_split(files_dict[0], num_parts):
      paths = [os.path.join(root_dir, file_name) for file_name in sub_file_names]
      mergedDF = parquetMerge(paths)
      final_write_parquet(megedDF,outputPath)

Dask数据帧假定所有分区具有相同的模式（列名和数据类型）。如果要混合具有几乎相同模式的不同数据集，则需要手动处理。Dask数据帧目前不提供自动支持。

对于内存问题：使用“pyarrow表”而不是“pandas数据帧”

对于模式问题：您可以创建自己的自定义“pyarrow模式”，并使用您的模式强制转换每个pyarrow表

    import pyarrow as pa
    import pyarrow.parquet as pq
    def merge_small_parquet_files(small_files, result_file):
        pqwriter = None
        for small_file in small_files:
            table = pq.read_table(small_file)
            pyarrow_schema = get_pyarrow_schema()
            if not pqwriter:
                pqwriter = pq.ParquetWriter(result_file,
                                        schema=pyarrow_schema,
                                        compression='GZIP',
                                        coerce_timestamps='ms', allow_truncated_timestamps=True)
                table = table.cast(pyarrow_schema)
                pqwriter.write_table(table)
                table = None
                del table
            if pqwriter:
                pqwriter.close()

    def get_pyarrow_schema():
        fields = []
        fields.append(pa.field('first_name', pa.string()))
        fields.append(pa.field('last_name', pa.string()))
        fields.append(pa.field('Id', pa.float64()))
        fields.append(pa.field('Salary', pa.float64()))
        fields.append(pa.field('Time', pa.timestamp('ms')))
        pyarrow_schema = pa.schema(fields)
        return pyarrow_schema
    if __name__ == '__main__':
        small_files = ['file1.parquet', 'file2.parquet', 'file3.parquet', 'file4.parquet']
        result_file = 'large.parquet'
        merge_small_parquet_files(small_files, result_file)

您是否介意提供一份报告，向我们展示您迄今为止所取得的成就？特别是计划在哪些方面有所不同？有些文件缺少列吗？数据类型不同吗？是的，列名不同。我有数据格式，一组是时间戳，A到D，第二组是时间戳E到H等等。。