Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/352.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在不同的文件中存储拼花地板文件分区列_Python_Pandas_Parquet_Pyarrow_Apache Arrow - Fatal编程技术网

Python 在不同的文件中存储拼花地板文件分区列

Python 在不同的文件中存储拼花地板文件分区列,python,pandas,parquet,pyarrow,apache-arrow,Python,Pandas,Parquet,Pyarrow,Apache Arrow,我想以拼花格式存储一个表格数据集,为不同的列组使用不同的文件。是否可以按列划分拼花地板文件?如果是这样的话,是否可以使用python(pyarrow)来实现呢 我有一个大型数据集,它收集了许多对象(行)的属性/特征(列)。行数约为100k-1M(行数将随时间增长)。相反,列在逻辑上分为200组,每组200-1000列。列的总数是固定的,但它们的数据是从列组1、列组2……开始依次获取的。但是,在收到对该列组进行排序的第一批数据之前,列名、类型和编号是未知的 数据将随时间收集。当数据到达时,我想在拼

我想以拼花格式存储一个表格数据集,为不同的列组使用不同的文件。是否可以按列划分拼花地板文件?如果是这样的话,是否可以使用python(pyarrow)来实现呢

我有一个大型数据集,它收集了许多对象(行)的属性/特征(列)。行数约为100k-1M(行数将随时间增长)。相反,列在逻辑上分为200组,每组200-1000列。列的总数是固定的,但它们的数据是从列组1、列组2……开始依次获取的。但是,在收到对该列组进行排序的第一批数据之前,列名、类型和编号是未知的

数据将随时间收集。当数据到达时,我想在拼花地板中存储这组不断增长的列。最终,所有列组都将填充数据。随着时间的推移,新对象(行)将到达,它们的数据将始终从列组1开始,并逐渐填充其他组

是否可以(或建议)将这些数据存储在单个逻辑拼花文件中,该文件在文件系统上拆分为多个文件,其中每个文件包含一个列组(200-1000列)?有人能提供一个使用python/pandas/pyarrow存储这样一个文件的例子吗


或者,每个col组可以存储为不同的逻辑拼花文件。在这种情况下,所有文件都将有一个
object\u id
索引列,但每个拼花地板文件(对于col组)将包含不同的对象子集。任何想法或建议都将不胜感激。

拼花地板文件只有一个模式。即使有多个分区,每个分区都将具有相同的模式,使工具能够像读取单个文件一样读取这些文件

如果从pandas端传入的数据正在更改,写入拼花地板文件将不起作用,因为模式与源所具有的模式不同

<> P>为了使这项工作的数据管道工作,你至少需要考虑以下几点:

收集所有列及其数据类型和列顺序

格式化dataframe以包含具有指定数据类型和列顺序的所有列

写在拼花地板上

请参阅下面的代码,以获得有关可能故障的更多信息

df = pd.DataFrame({"Date":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"},"NEWCOLUMN":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
#Read Table OK
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)

#Second Table Same Exact Columns in the Same order
df = pd.DataFrame({"Date":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"},"NEWCOLUMN":{"0":1514764800000,"1":1514764800000,"2":1514764800000,"3":1514764800000,"4":1514764800000,"5":1514764800000,"6":1514764800000,"7":1514764800000}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
#Read Table OK
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)

#Second Table same exact columns but wrong order ->Fails
df = pd.DataFrame({"NEWCOLUMN":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"},"Date":{"0":1514764800000,"1":1514764800000,"2":1514764800000,"3":1514764800000,"4":1514764800000,"5":1514764800000,"6":1514764800000,"7":1514764800000}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)

#Third Table with "NEWCOLUMN" left out ->Fails
df = pd.DataFrame({"Date":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)

你能加一个eg吗?