Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/email/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 10分钟样品中的料仓_Python_Pandas - Fatal编程技术网

Python 10分钟样品中的料仓

Python 10分钟样品中的料仓,python,pandas,Python,Pandas,我有一个由以下列组成的熊猫数据框架 col1, col2, _time col1 col2 since until count \u time列是该行在时间上出现的日期时间对象 我想在10分钟内对我的数据帧按两列进行重采样,并聚合每10分钟内发生的每个组的行数。我希望生成的数据帧具有以下列 col1, col2, _time col1 col2 since until count 其中since是每10分钟时间段的开始直到每10分钟时间段的结束,并计算在初始数据帧上找到的行数,如 col

我有一个由以下列组成的熊猫数据框架

col1, col2, _time
col1 col2 since until count
\u time
列是该行在时间上出现的日期时间对象

我想在10分钟内对我的数据帧按两列进行重采样,并聚合每10分钟内发生的每个组的行数。我希望生成的数据帧具有以下列

col1, col2, _time
col1 col2 since until count
其中
since
是每10分钟时间段的开始
直到
每10分钟时间段的结束,并计算在初始数据帧上找到的行数,如

col1  col2          since                  until         count
1      1       08/12/2017 12:00      08/12/2017 12:10       10
1      2       08/12/2017 12:00      08/12/2017 12:10        5
1      1       08/12/2017 12:10      08/12/2017 12:20        3

数据帧的重采样方法是否可能实现这一点?

我以前也一直在研究重采样方法,但没有任何效果。 幸运的是,我找到了一个解决方案使用

  • 用于将时间戳与10分钟间隔对齐
  • 在groupby中使用生成的对象(或者,可以选择将其指定给源数据中的列,然后使用该列)
  • 用于从
    列计算
  • 比如说,

    import pandas as pd
    
    interval = '10min'  # 10 minutes intervals, please
    
    # Dummy data with 3-minute intervals
    data = pd.DataFrame({
        'col1': [0, 0, 1, 0, 0, 0, 1, 0, 1, 1], 
        'col2': [4, 4, 4, 3, 4, 4, 3, 3, 4, 4], 
        '_time': pd.date_range(start='2010-01-01 00:01:00', freq='3min', periods=10),
    })
    
    # Floor the timestamps to your desired interval
    since = data['_time'].dt.floor(interval).rename('since')
    
    # Get the size of each group - groups are in the index of `agg`
    agg = data.groupby(['col1', 'col2', since]).size()
    agg = agg.rename('count')
    
    # Back to dataframe
    agg = agg.reset_index()
    
    # Simply add your interval to `since`
    agg['until'] = agg['since'] + pd.to_timedelta(interval)
    
    print(agg)
    
       col1  col2               since  count               until
    0     0     3 2010-01-01 00:10:00      1 2010-01-01 00:20:00
    1     0     3 2010-01-01 00:20:00      1 2010-01-01 00:30:00
    2     0     4 2010-01-01 00:00:00      2 2010-01-01 00:10:00
    3     0     4 2010-01-01 00:10:00      2 2010-01-01 00:20:00
    4     1     3 2010-01-01 00:10:00      1 2010-01-01 00:20:00
    5     1     4 2010-01-01 00:00:00      1 2010-01-01 00:10:00
    6     1     4 2010-01-01 00:20:00      2 2010-01-01 00:30:00
    

    如果您仍在寻找答案,此示例可能在某些方面对您有所帮助

    import pandas as pd
    import numpy as np
    import datetime
    
    # create some random data
    df = pd.DataFrame(columns=["col1","col2","timestamp"])
    df.col1 = np.random.randint(100, size = 10)
    df.col2 = np.random.randint(100, size = 10)
    df.timestamp = [datetime.datetime(2000,1,1) + \
                datetime.timedelta(hours=int(i)) for i in np.random.randint(100, size = 10)]
    
    # sort data by timestamp and reset index
    df = df.sort_values(by="timestamp").reset_index(drop=True)
    
    # create the bins by taking last first time and last time with freq 6h
    bins = pd.date_range(start=df.timestamp.values[0],end=df.timestamp.values[-1], freq="6h") # change to reasonable freq (d, h, m, s) 
    # zip them to pairs
    startend =  list(zip(bins, bins.shift(1)))
    
    # define a function that finds bin index
    def time_in_range(x):
        """Return true if x is in the range [start, end]"""
        for ind,(start,end) in enumerate(startend):
            if start <= x <= end:
                return ind
    
    
    # Add bin index to column named index
    df['index'] = df.timestamp.apply(time_in_range)
    # groupby index to find sum and count
    df = df.groupby('index')["col1","col2"].agg(['sum','count']).reset_index()
    
    
    # Create output df2 (with bins)        
    df2 = pd.DataFrame(startend, columns=["start","end"]).reset_index()
    
    # Join the two dataframes with column index
    df3 =pd.merge(df2, df, how='outer', on='index').fillna(0)
    
    # Final adjustments
    df3.columns = ["index","start","end","col1","delete","col2","count"]
    df3.drop(['delete','index'], axis=1, inplace=True)
    
    将熊猫作为pd导入
    将numpy作为np导入
    导入日期时间
    #创建一些随机数据
    df=pd.DataFrame(列=[“col1”、“col2”、“timestamp”])
    df.col1=np.random.randint(100,大小=10)
    df.col2=np.random.randint(100,大小=10)
    df.timestamp=[datetime.datetime(2000,1,1)+\
    np.random.randint(100,size=10)中i的datetime.timedelta(小时=int(i))]
    #按时间戳和重置索引对数据进行排序
    df=df.sort_值(by=“timestamp”).reset_索引(drop=True)
    #通过使用freq 6h记录最后一次和最后一次来创建垃圾箱
    bins=pd.date_范围(开始=df.timestamp.values[0],结束=df.timestamp.values[-1],freq=“6h”)#更改为合理的频率(d,h,m,s)
    #把它们拉成一对
    startend=list(zip(箱子,箱子.移位(1)))
    #定义一个查找bin索引的函数
    def time_在_范围内(x):
    “”“如果x在范围[start,end]内,则返回true”“”
    对于枚举(startend)中的ind(开始、结束):
    
    如果开始,您可以提供初始样本数据吗?