修补Python中CSV文件中缺少的行

修补Python中CSV文件中缺少的行,python,csv,Python,Csv,我有一个CSV文件,其中包含多天内每天每分钟的行。它是由数据采集系统生成的,有时会漏掉几行 数据如下所示-日期时间字段后跟一些整数 "2017-01-07 03:00:02","7","3","2","13","0" "2017-01-07 03:01:02","7","3","2","13","0" "2017-01-07 03:02:02","7","3","2","12","0" "2017-01-07 03:07:02","7","3","2","12","0" "2017-01-07

我有一个CSV文件,其中包含多天内每天每分钟的行。它是由数据采集系统生成的,有时会漏掉几行

数据如下所示-日期时间字段后跟一些整数

"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"
上面的(真实数据)示例中缺少行。由于样本之间的数据变化不大,我只想将最后一个有效数据复制到缺少的行中。我遇到的问题是检测缺少哪些行

我正在用自己拼凑的python程序处理CSV(我对python非常陌生)。 这可以用来处理我的数据

import csv
import datetime

with open("minutedata.csv", 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...
我不确定如何检查当前行是否是序列中的下一行,或者是在缺少的行之后

编辑以添加:

感谢jeremycg的指点,我现在正在尝试使用熊猫

我在CSV中添加了一个标题行,因此现在它看起来像:

time,v1,v2,v3,v4,v5
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"
处理代码现在是:

import pandas as pd
import io
z = pd.read_csv('minutedata.csv')
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for row in z:
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...
但这个错误是:

Traceback (most recent call last):
File "process_day.py", line 14, in <module>
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2821, in reindex
**kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2259, in reindex fill_value, copy).__finalize__(self)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2767, in _reindex_axes
fill_value, limit, tolerance)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2778, in _reindex_index allow_dups=False)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in _reindex_with_indexers copy=copy)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3839, in reindex_indexer self.axes[axis]._can_reindex(indexer)
File "/usr/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 2494, in _can_reindex raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

我衷心感谢所有帮助过我的人大卫

你可能应该用熊猫来做这个,因为它是为这种东西做的

首先阅读csv:

import pandas as pd
import io
x = '''
time,a,b,c,d,e
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"''' #your data, with added headers
z = pd.read_csv(io.StringIO(x)) #you can use your file name here
现在z是一个数据帧:

z.head()

time    a   b   c   d   e
0   2017-01-07 03:00:02 7   3   2   13  0
1   2017-01-07 03:01:02 7   3   2   13  0
2   2017-01-07 03:02:02 7   3   2   12  0
3   2017-01-07 03:07:02 7   3   2   12  0
4   2017-01-07 03:08:02 6   3   2   12  1
我们希望: 将“时间”列转换为pd.datetime:

z['time'] = pd.to_datetime(z['time'])
将数据帧的“索引”设置为时间,然后在我们的范围内重新索引:

z = z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min"))
z

a   b   c   d   e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:03:02 NaN NaN NaN NaN NaN
2017-01-07 03:04:02 NaN NaN NaN NaN NaN
2017-01-07 03:05:02 NaN NaN NaN NaN NaN
2017-01-07 03:06:02 NaN NaN NaN NaN NaN
2017-01-07 03:07:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0    1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0    1.0
然后使用.ffill()从上一个值填充:

z.ffill()

a   b   c   d   e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:03:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:04:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:05:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:06:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:07:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0    1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0    1.0
或者,总的来说:

z = pd.read_csv(io.StringIO(x))
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()

建议按照jeremycg的建议使用熊猫。不过,如果您正在寻找一个没有熊猫的解决方案,那么它是这样的:

import csv
import datetime

data = []

with open("minutedata.csv", newline='') as f:
    reader = csv.reader(f, delimiter=',')

    prev_date = None

    for row in reader:

        date = datetime.datetime.strptime(row[0], "%Y-%m-%d %H:%M:%S")

        if prev_date:
            diff = date - prev_date

            if diff > datetime.timedelta(minutes=1):

                for i in range((int(diff.total_seconds() / 60) - 1)):
                    new_date = prev_date + datetime.timedelta(minutes=i + 1)
                    new_row = [str(new_date)] + row[1:]

                    data.append(",".join(new_row))

        prev_date = date

        data.append(",".join(row))

print(data)
说明: 我们遍历每一行并检查当前行的日期与前一行的日期

diff = date - prev_date
如果我们看到差异大于1分钟,我们将进入一个循环,在缺失数据的范围内运行

if diff > datetime.timedelta(minutes=1):

    for i in range((int(diff.total_seconds() / 60) - 1)):
        ...
我们通过将分钟数添加到上一个日期来计算缺少的值

new_date = prev_date + datetime.timedelta(minutes=i + 1)
new_row = [str(new_date)] + row[1:]

你完了

迭代时,通过将时间戳存储在变量中并在每次迭代结束时更新它来跟踪前一行中的时间戳。刚刚尝试过这个,请参阅最新的编辑,但它以一种混乱的方式消失-知道吗?感谢您的帮助,Pandas看起来非常有用。您的文件中似乎有一些重复的时间戳。您可以尝试在
z[~z.time.duplicated()]
之前添加行
z['time']=pd.to_datetime(z['time'])
这是同一回事,不是吗?感谢您在这方面的帮助-我明天将以全新的眼光看待这一点。啊
z=z[~z.time.duplicated()]
之前我们做了过滤器,但没有做分配非常感谢您修复了它。我完全错过了分配给我的任务。谢谢。我将试着让熊猫走上正轨,因为我还有另一个用途,熊猫看起来非常有用。我喜欢您的解决方案,我可以理解python为何如此流行——它是可运行的伪代码
new_date = prev_date + datetime.timedelta(minutes=i + 1)
new_row = [str(new_date)] + row[1:]