Python 合并多个数据帧对象_Python_Pandas_Dataframe_Merging Data

Python 合并多个数据帧对象

python pandas dataframe

Python 合并多个数据帧对象,python,pandas,dataframe,merging-data,Python,Pandas,Dataframe,Merging Data,我有一个熊猫数据框对象的列表，df_list，它是我从csv和拼花文件中加载的，并且有时间戳对象作为索引。我想将它们合并到一个DataFrame对象中，如下所示：每个数据帧对象的索引都是唯一的。合并数据帧的索引也应该是唯一的，并且应该按索引列进行排序。如果一个索引存在于两个或多个原始DataFrame对象中，则该索引的列值对于所有这些对象都必须相同。否则，脚本应显示错误消息，例如引发异常。某些列具有浮点值，其中一些值在最初保存为csv时可能会丢失精度，因为这样做需要将值转换为包含舍入的文本。

我有一个熊猫数据框对象的列表，df_list，它是我从csv和拼花文件中加载的，并且有时间戳对象作为索引。我想将它们合并到一个DataFrame对象中，如下所示：

每个数据帧对象的索引都是唯一的。合并数据帧的索引也应该是唯一的，并且应该按索引列进行排序。如果一个索引存在于两个或多个原始DataFrame对象中，则该索引的列值对于所有这些对象都必须相同。否则，脚本应显示错误消息，例如引发异常。某些列具有浮点值，其中一些值在最初保存为csv时可能会丢失精度，因为这样做需要将值转换为包含舍入的文本。如果是这种情况，值比较需要考虑到这一点。任何原始DataFrame对象都被视为覆盖第一行索引和最后一行索引之间的时间段。我想知道合并的dataframe对象覆盖哪些时间段。这是原始数据帧对象时间段的并集，可能包含间隙。如何使用Pandas命令有效地执行此操作

我尝试了以下方法：

intervals = pd.DataFrame(columns=column_name_list).set_index(index_name)
for current_df in df_list:
    for index in current_df.index:
        if index in intervals.index:
            if current_df.loc[index] != intervals.loc[index]:
                raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
        intervals.loc[index] = current_df.loc[index]

但这是非常缓慢的，我得到以下错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-a3867a7c8aa1> in <module>
     78             for index in current_df.index:
     79                 if index in intervals.index:
---> 80                     if current_df.loc[index] != intervals.loc[index]:
     81                         raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
     82                 intervals.loc[index] = current_df.loc[index]

D:\ProgramData\Miniconda3\envs\stocks\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1327 
   1328     def __nonzero__(self):
-> 1329         raise ValueError(
   1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

因此，我似乎无法使用！=操作人员此外，我的代码目前没有考虑可能的舍入错误，也没有确定覆盖的时间段

我尝试了以下方法：

intervals = pd.DataFrame(columns=column_name_list).set_index(index_name)
for current_df in df_list:
    for index in current_df.index:
        if index in intervals.index:
            if current_df.loc[index] != intervals.loc[index]:
                raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
        intervals.loc[index] = current_df.loc[index]

这是在现有数据帧的每个元素和新数据帧的每个元素上循环。这在^2上，速度很慢。如果您可以对数据进行排序，并在一次传递中处理这些数据，则速度会快得多

如何使用Pandas命令有效地执行此操作

您可以使用concat进行此操作，然后进行排序，然后使用keep=False在数据帧上获取布尔索引

然后，将该索引与.loc[]一起使用，以获取具有重复索引值的数据帧子集。称之为dupe数据帧。然后，使用删除完全相同的行。如果dupe dataframe此时的行数不为零，则引发异常，因为存在具有不同列值的重复索引值。dupe数据帧包含有问题的数据

这个解释有点模糊，对不起。您没有包含示例数据集，因此我没有任何测试依据

某些列具有浮点值，其中一些值在最初保存为csv时可能会丢失精度，因为这样做需要将值转换为包含舍入的文本

这里有一个更简单的方法：您应该以写CSV的相同方式读取CSV。看看你的往返旅程

如果这不可行，请参阅

但这是非常缓慢的，我得到以下错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-a3867a7c8aa1> in <module>
     78             for index in current_df.index:
     79                 if index in intervals.index:
---> 80                     if current_df.loc[index] != intervals.loc[index]:
     81                         raise RuntimeError("Entries for {} do not match: {} and {}, respectively".format(repr(index), repr(current_df.loc[index]), repr(intervals.loc[index])))
     82                 intervals.loc[index] = current_df.loc[index]

D:\ProgramData\Miniconda3\envs\stocks\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1327 
   1328     def __nonzero__(self):
-> 1329         raise ValueError(
   1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

如果调用df.loc[foo]，它将在索引中查找foo。这将为您提供一个与行的值相对应的序列。如果将两个序列与==进行比较，则会得到另一个具有真/假值的序列。为了将此序列视为单个布尔值，您需要确定是否对任何真值感兴趣，或者是否仅在所有值都为真时才希望它为真

e、 g

如果当前位置[索引]！=interval.loc[index]。任何：

好的，我不确定数据框是否以某种树结构的形式存储索引，因为我发现关于数据框内存布局的信息非常少；在这种情况下，确定索引是否存在可能需要花费大量时间。。。但另一方面，我想在正确的位置插入元素仍然会按时完成，因此代码仍然会总共在^2上运行。有关Pandas内存布局的信息，请单击。有关索引查找的时间复杂性的信息，请单击。