Python Pandas-ValueError:无法从重复轴重新编制索引_Python_Pandas

Python Pandas-ValueError:无法从重复轴重新编制索引

python pandas

Python Pandas-ValueError:无法从重复轴重新编制索引,python,pandas,Python,Pandas,我正在处理气流中的一条数据管道，并不断遇到这个ValueError:无法从重复的轴重新编制索引，我已经用头敲了好几天了下面是一个混乱的函数： def fill_missing_dates(df): df['TUNING_EVNT_START_DT'] = pd.to_datetime(df['TUNING_EVNT_START_DT']) dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().in

我正在处理气流中的一条数据管道，并不断遇到这个

ValueError:无法从重复的轴重新编制索引，我已经用头敲了好几天了
下面是一个混乱的函数：
def fill_missing_dates(df):
    df['TUNING_EVNT_START_DT'] = pd.to_datetime(df['TUNING_EVNT_START_DT'])
    dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
    masdiv = df['MASDIV'].unique()
    station = df['STATION'].unique()
    idx = pd.MultiIndex.from_product((dates, masdiv, station), names=['TUNING_EVNT_START_DT', 'MASDIV', 'STATION'])
    df = df.set_index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION']).reindex(idx, fill_value=0).reset_index()

    return df

以下是AWS Cloudwatch日志的错误输出：
16:31:40
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 821, in asfreq
16:31:40
return self._upsample("asfreq", fill_value=fill_value)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 1125, in _upsample
16:31:40
res_index, method=method, limit=limit, fill_value=fill_value
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/util/_decorators.py", line 221, in wrapper
16:31:40
return func(*args, **kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3976, in reindex
16:31:40
return super().reindex(**kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4514, in reindex
16:31:40
axes, level, limit, tolerance, method, fill_value, copy
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3864, in _reindex_axes
16:31:40
index, method, copy, level, fill_value, limit, tolerance
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3886, in _reindex_index
16:31:40
allow_dups=False,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4577, in _reindex_with_indexers
16:31:40
copy=copy,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 1251, in reindex_indexer
16:31:40
self.axes[axis]._can_reindex(indexer)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/base.py", line 3362, in _can_reindex
16:31:40
raise ValueError("cannot reindex from a duplicate axis")
16:31:40
ValueError: cannot reindex from a duplicate axis
16:31:40
"""
16:31:40
The above exception was the direct cause of the following exception:
16:31:40
Traceback (most recent call last):
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 275, in <module>
16:31:40
runner(path_prefix, model_name, execution_id, table)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 230, in runner
16:31:40
df = multiprocessing(PROCESSORS, df)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 121, in multiprocessing
16:31:40
x = pool.map(iforest, (df.loc[df['MASDIV'] == masdiv] for masdiv in args))
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 268, in map
16:31:40
return self._map_async(func, iterable, mapstar, chunksize).get()
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 657, in get
16:31:40
raise self._value
16:31:40
ValueError: cannot reindex from a duplicate axis

我在这些帖子中尝试了一切，但都没有成功：


我也不完全明白为什么会发生这种情况。任何建议都将不胜感激
 没有示例数据，我无法重现您的错误。然而，基于函数名“fill_missing_dates”，我认为这个替代解决方案可能会实现您想要实现的目标
import pandas as pd

df = pd.DataFrame({
    'date': ["2020-01-01 00:01:00", "2020-01-01 00:02:00", "2020-01-01 01:00:00", "2020-01-01 02:00:00",
             "2020-01-01 00:04:00", "2020-01-01 00:05:00",
             "2020-01-03 00:01:00", "2020-01-03 00:02:00", "2020-01-03 01:00:00", "2020-01-03 02:00:00",
             "2020-01-03 00:04:00", "2020-01-03 00:05:00",
            ],
    'station': ["a","a","a","a","b", "b", "a", "a", "a", "a", "b", "b"],
    'data': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})

def resampler(x):    
    return x.set_index('date').resample('D').sum()

df['date'] =  pd.to_datetime(df['date'])
multipass = pd.MultiIndex.from_frame(df[["date", "station"]])
df = df.set_index(["date", "station"])
df = df.reindex(multipass)
df.reset_index(level=0).groupby(level=0).apply(resampler)

结果用0填充缺少的日期：
                        data
station  date   
a        2020-01-01     10
         2020-01-02     0
         2020-01-03     34
b        2020-01-01     11
         2020-01-02     0
         2020-01-03     23

                        data
station  date   
a        2020-01-01     10
         2020-01-02     0
         2020-01-03     34
b        2020-01-01     11
         2020-01-02     0
         2020-01-03     23