Python 如何使用TimeGrouper迭代包含不同范围的多个文件_Python_Pandas_Time Series_Pandas Groupby

Python 如何使用TimeGrouper迭代包含不同范围的多个文件

python pandas

Python 如何使用TimeGrouper迭代包含不同范围的多个文件,python,pandas,time-series,pandas-groupby,Python,Pandas,Time Series,Pandas Groupby,我有一套档案。每个文件都有1秒的数据。此外，这些文件不是周期性的，即它们不是日常文件。例如，一个文件可能包含一天半的数据，而下一个文件可能包含3天半的数据；文件之间及其内部可能存在间隙。另一个问题是，同时加载内存中的所有文件是不实际的这里有一个具体的例子来说明这个问题。以下数据帧具有一天半的1秒数据： index = pd.date_range('now', periods=60*60*24*1.5, freq='1S') data_a = pd.DataFrame(np.random.ran

我有一套档案。每个文件都有1秒的数据。此外，这些文件不是周期性的，即它们不是日常文件。例如，一个文件可能包含一天半的数据，而下一个文件可能包含3天半的数据；文件之间及其内部可能存在间隙。另一个问题是，同时加载内存中的所有文件是不实际的

这里有一个具体的例子来说明这个问题。以下数据帧具有一天半的1秒数据：

index = pd.date_range('now', periods=60*60*24*1.5, freq='1S')
data_a = pd.DataFrame(np.random.rand(len(index)), index=index, columns=['data'])

下一个数据帧从上一个数据帧停止的位置开始，它有两天的数据：

index = pd.date_range(data_a.index[-1] + pd.Timedelta('1S'), periods=60*60*24*2, freq='1S')
data_b = pd.DataFrame(np.random.rand(len(index)), index=index, columns=['data'])

让我们在每个数据帧上创建10分钟的迭代器，然后执行以下操作：

如果我们在

iaib

上迭代，我们期望的行为是只查看每个组键（及其数据）一次，但事实并非如此

seen = {}
for name, group in iaib:
    count = seen.get(name, 0)
    seen[name] = count + 1

seen_twice = {key: value for key, value in seen.items() if value > 1}

两次看到的

的内容是：
{Timestamp('2017-06-02 08:50:00', freq='10T'): 2}

在本例中，2017-06-02 08:50:00
是最后一组data\u a
和第一组data\u b
的键
如何通过10分钟的分组对所有文件进行迭代，而不在文件边缘重复分组？
解决方案有两部分：一是将所有文件作为单个数据集进行处理；另一个原因是，一个10分钟的组可以在一个文件的结尾和下一个文件的开头之间分割
这些是所需的进口：
from itertools import chain

import pandas as pd
from pandas.tseries.resample import TimeGrouper

将所有文件作为单个数据集处理
此函数返回给定文件的10分钟组的迭代器：
def make_iterator(file):
    df = pd.read_csv(file, index_col='timestamp', parse_dates=['timestamp'])
    return iter(df.groupby(TimeGrouper('10Min')))

上面的函数用于创建一个迭代器迭代器。给定一个文件列表，可以在文件集合的所有10分钟组上创建一个迭代器，如下所示：
files = ... # list obtained by os.listdir() or glob.glob()    
iterator_of_single_file_group_iterators = map(make_iterator, files)
chained_file_group_iterator = chain.from_iterable(iterator_of_single_file_group_iterators)

说明一个组可以在一个文件的结尾和下一个文件的开头之间拆分
但是，上面的迭代器不知道跨越两个文件的10分钟组。以下类解决了该问题：
class TimeGrouperChainDecorator(object):

    def __init__(self, iterator):
        self.iterator = iterator
        self._has_more = True
        self._last_item = next(self.iterator)

    def __iter__(self):
        return self

    def __next__(self):
        if not self._has_more:
            raise StopIteration
        try:
            return self._next()
        except StopIteration:
            self._has_more = False
            if self._last_item is not None:
                return self._last_item
            raise StopIteration

    def _next(self):
        new_key, new_data = next(self.iterator)

        last_key, last_data = self._last_item
        if new_key == last_key:
            data = pd.concat([last_data, new_data])
            try:
                self._last_item = next(self.iterator)
            except StopIteration:
                self._has_more = False
            return new_key, data
        else:
            self._last_item = new_key, new_data
            return last_key, last_data

注意，实现完全依赖于pandasgroupby
API。要使用它，请使用上述链接迭代器创建类的实例：
iterator = TimeGrouperChainDecorator(chained_file_group_iterator)

for name, group in iterator:
    # do something with each 10 minute group

我的实现可能并不完美，因此欢迎任何反馈。我已经发布了一个。感谢您的反馈，希望我已经改进了这个问题。
iterator = TimeGrouperChainDecorator(chained_file_group_iterator)

for name, group in iterator:
    # do something with each 10 minute group