Python AssertionError的解决方案:连接数据帧列表上的操作时,get_concat_dtype中的数据类型确定无效

Python AssertionError的解决方案:连接数据帧列表上的操作时,get_concat_dtype中的数据类型确定无效,python,csv,pandas,Python,Csv,Pandas,我有一个数据帧列表,我正试图使用连接函数组合这些数据帧 dataframe_lists = [df1, df2, df3] result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True) 完整回溯是: --------------------------------------------------------------------------- AssertionError

我有一个数据帧列表,我正试图使用连接函数组合这些数据帧

dataframe_lists = [df1, df2, df3]

result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
完整回溯是:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
      2 check(dataframe_lists)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    753                        verify_integrity=verify_integrity,
    754                        copy=copy)
--> 755     return op.get_result()
    756 
    757 

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
    924 
    925             new_data = concatenate_block_managers(
--> 926                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
    927             if not self.copy:
    928                 new_data._consolidate_inplace()

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
   4150         raise AssertionError("Concatenating join units along axis0")
   4151 
-> 4152     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   4153 
   4154     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
   4139         return np.dtype('m8[ns]'), tslib.iNaT
   4140     else:  # pragma
-> 4141         raise AssertionError("invalid dtype determination in get_concat_dtype")
   4142 
   4143 

AssertionError: invalid dtype determination in get_concat_dtype
我想知道,如果是空数据帧,是否可以使用此函数返回空数据帧的头并将其附加到连接的数据帧。输出将是标题的一行(如果是重复的列名,则仅为标题的一个实例(如连接函数的情况)。我有两个示例数据源和非空数据集。下面是一个空的

我想有结果的连接有列标题

 'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
要使空数据帧的标题与此行一起追加(如果它们是新的)

我欢迎对最佳方法的反馈

正如下面的答案所详述的,这是一个相当意外的结果:

不幸的是,由于该材料的敏感性,我无法分享实际数据。导致要点中所述内容的原因如下:

A= data[data['RRT'] == 'A'] #Select just the columns with  from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']
data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True,  chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)
对于每个新数据帧,我应用以下逻辑:

for column_name, column in A.transpose().iterrows():
    AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']]  #get select columns indexed with dataframe, "A"
当我对空数据帧A执行绑定方法时:

AColumns.count
这是输出:

<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>
我不确定还能提供什么。串联方法适用于满足要求所需的所有其他数据帧。我还查看了Pandas internal.py和完整跟踪。要么是我有太多带有NaN的列,要么是重复的列名,要么是混合的数据类型(后者是最不可能的罪魁祸首)


再次感谢您的指导。

我无法重现您的错误,它对我正常工作:

df1 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/42708e6a3ca0aed9b79b/raw/f37738994c3285e1b670d3926e716ae027dc30bc/sample_data.csv')
df2 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/26eb4ce1578e0844eb82/raw/23d9063dad7793d87a2fed2275857c85b59d56bb/sample2.csv')
df3 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/0721bd8b71416b54eccd/raw/b7ecae63beff88bd076a93d83500eb5fa67e1278/empty_df.csv')
pd.concat([df1,df2,df3], keys = ['one', 'two','three'], ignore_index=True).head()

Out[68]: 
   'B'  'C'  'D'  'E'  'F'  'G'  'A'  AT  AccountNum  AcctType ...   0  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
1  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
2  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    

   ToAccountNum  ToAccountT  TransferAmount  TransferMade  TransferTimestamp  0           NaN         NaN               4          True      1/7/2000 0:00   
1           NaN         NaN               4          True      1/8/2000 0:00   
2           NaN         NaN               6          True      1/9/2000 0:00   
3           NaN         NaN               6          True     1/10/2000 0:00   
4           NaN         NaN               0         False     1/11/2000 0:00   

   Ttype  Unnamed: 0  WA   WC  Zip  
0      D           4 NaN  NaN  NaN  
1      D           5 NaN  NaN  NaN  
2      D          13 NaN  NaN  NaN  
3      D          14 NaN  NaN  NaN  
4      T          25 NaN  NaN  NaN  

[5 rows x 41 columns]

我注意到,当连接或附加空数据帧时,这是可能的。请尝试以下示例:

    my_headers = ['A,' 'B', 'C']
我有一个带有值的数据帧df_输入,其中头不一定与
my_头
相同

    dictionary = {element:None for element in my_headers}
    df = DataFrame(dictionary, index=[0])
    #append the two dataframes
    df_final = df_input.append(df)

在我们的一个项目中,我们遇到了相同的错误。调试后,我们发现了问题。我们的一个数据帧有两个同名列。重命名其中一个列后,问题得到了解决。

这通常意味着其中一个数据帧中有两个同名列

您可以通过查看

len(df.columns) > len(np.unique(df.columns))
对于您尝试连接的每个数据帧
df

您可以使用
计数器
识别罪魁祸首列,例如,请参见:

from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]

用一个最简单的例子来说明这个问题是很有用的,即几个非常小的数据帧,其中一个是空的,但是一种方法是通过给它一行值使它变为非空的,这些值可以在产生后从连接中删除。@TrisNefzger我添加了一个空数据帧和所需的输出。如何我会这样做吗?用一个伪变量或内置的fillna方法填充它?此外,如何删除该实例?您使用的是什么版本的pandas?@joris我使用0.16.2;Python 3.4.3 64位和Jupyter笔记本作为我的想法我编辑了我最初的问题,从“我欢迎关于实现这一点的最佳方法的反馈”开始已确认,完全相同的问题。错误消息可能会更好。这就是它为我所做的。
len(df.columns) > len(np.unique(df.columns))
from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]