Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/asp.net/29.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 迭代数据帧并插入行的最快方法_Python_Pandas_Performance_Dataframe - Fatal编程技术网

Python 迭代数据帧并插入行的最快方法

Python 迭代数据帧并插入行的最快方法,python,pandas,performance,dataframe,Python,Pandas,Performance,Dataframe,我正在构建一个工具,帮助每周自动审查几个实验室设置的数据。每天都会生成一个以制表符分隔的文本文件。每行表示每2秒获取的数据,因此有43200行和许多列(每个文件为75mb) 我正在使用pandas.readcsv加载七个文本文件,并且只将我需要的三列提取到pandas数据框中。这比我想要的慢,但可以接受。然后,我使用Plotly offline打印数据以查看交互式打印。这是一项计划任务,设置为每周运行一次 绘制数据与日期和时间的对比图。通常情况下,测试设置暂时脱机,数据中会出现间隙。不幸的是,当

我正在构建一个工具,帮助每周自动审查几个实验室设置的数据。每天都会生成一个以制表符分隔的文本文件。每行表示每2秒获取的数据,因此有43200行和许多列(每个文件为75mb)

我正在使用pandas.readcsv加载七个文本文件,并且只将我需要的三列提取到pandas数据框中。这比我想要的慢,但可以接受。然后,我使用Plotly offline打印数据以查看交互式打印。这是一项计划任务,设置为每周运行一次

绘制数据与日期和时间的对比图。通常情况下,测试设置暂时脱机,数据中会出现间隙。不幸的是,当绘制此图时,所有数据都是通过线连接的,即使测试脱机数小时或数天

防止这种情况发生的唯一方法是插入一行,在两个日期之间插入一个日期,其中包含实际数据,并为所有缺少的数据插入一个NAN。我已经很容易地为丢失的数据文件实现了这一点,但是我想将其推广到大于某个时间段的任何数据间隙。我提出了一个解决方案,似乎有效,但速度非常慢:

# alldata is a pandas dataframe with 302,000 rows and 4 columns
# one datetime column and three float32 columns

alldata_gaps  = pandas.DataFrame() #new dataframe with gaps in it

#iterate over all rows. If the datetime difference between 
#two consecutive rows is more than one minute, insert a gap row.

for i in range(0, len(alldata)):
    alldata_gaps = alldata_gaps.append(alldata.iloc[i])
    if alldata.iloc[i+1, 0]-alldata.iloc[i,0] > datetime.timedelta(minutes=1):
        Series = pandas.Series({'datetime' : alldata.iloc[i,0]
        +datetime.timedelta(seconds=3)})
        alldata_gaps = alldata_gaps.append(Series)
        print(Series)
有没有人建议我如何加快这项行动,这样就不会花这么长的时间


几乎可以肯定,您的瓶颈来自:

另一方面,您将一个变量命名为与Pandas对象相同的变量
pd.Series
。避免这种模棱两可的做法是很好的

更有效的解决方案是:

  • 确定差距出现的时间
  • 使用这些时间+3秒的数据创建单个数据帧
  • 附加到现有数据帧并按时间排序
  • 让我们用一个示例数据帧来尝试一下:

    # example dataframe setup
    df = pd.DataFrame({'Date': ['00:10:15', '00:15:20', '00:15:40', '00:16:50', '00:17:55',
                                '00:19:00', '00:19:10', '00:19:15', '00:19:55', '00:20:58'],
                       'Value': list(range(10))})
    
    df['Date'] = pd.to_datetime('2018-11-06-' + df['Date'])
    
    # find gaps greater than 1 minute
    bools = (df['Date'].diff().dt.seconds > 60).shift(-1).fillna(False)
    idx = bools[bools].index
    # Int64Index([0, 2, 3, 4, 8], dtype='int64')
    
    # construct dataframe to append
    df_extra = df.loc[idx].copy().assign(Value=np.nan)
    
    # add 3 seconds
    df_extra['Date'] = df_extra['Date'] + pd.to_timedelta('3 seconds')
    
    # append to original
    res = df.append(df_extra).sort_values('Date')
    
    结果:

    print(res)
    
                     Date  Value
    0 2018-11-06 00:10:15    0.0
    0 2018-11-06 00:10:18    NaN
    1 2018-11-06 00:15:20    1.0
    2 2018-11-06 00:15:40    2.0
    2 2018-11-06 00:15:43    NaN
    3 2018-11-06 00:16:50    3.0
    3 2018-11-06 00:16:53    NaN
    4 2018-11-06 00:17:55    4.0
    4 2018-11-06 00:17:58    NaN
    5 2018-11-06 00:19:00    5.0
    6 2018-11-06 00:19:10    6.0
    7 2018-11-06 00:19:15    7.0
    8 2018-11-06 00:19:55    8.0
    8 2018-11-06 00:19:58    NaN
    9 2018-11-06 00:20:58    9.0
    

    我的总体想法与jpp的答案相同:与其迭代数据帧(这对于您拥有的数据量来说很慢),不如只识别感兴趣的行并处理它们。主要区别在于1)将多列转换为NA,2)将NA行时间戳调整为周围时间的一半

    我在评论中添加了解释

    # after you read in your data, make sure the time column is actually a datetime
    df['datetime'] = pd.to_datetime(df['datetime'])
    
    # calculate the (time) difference between a row and the previous row
    df['time_diff'] = df['datetime'].diff()
    
    # create a subset of your df where the time difference is greater than
    # some threshold. This will be a dataframe of your empty/NA rows.
    # I've set a 2 second threshold here because of the sample data you provided, 
    # but could be any number of seconds
    empty = df[df['time_diff'].dt.total_seconds() > 2].copy()
    
    # calculate the correct timestamp for the NA rows (halfway and evenly spaced)
    empty['datetime'] = empty['datetime'] - (empty['time_diff'].shift(-1) / 2)
    
    # set all the columns to NA apart from the datetime column
    empty.loc[:, ~empty.columns.isin(['datetime'])] = np.nan
    
    # append this NA/empty dataframe to your original data, and sort by time
    df = df.append(empty, ignore_index=True)
    df = df.sort_values('datetime').reset_index(drop=True)
    
    # optionally, remove the time_diff column we created at the beginning
    df.drop('time_diff', inplace=True, axis=1)
    
    这会给你类似的东西:


    您能否提供一个简单的示例,说明您的数据是什么样子的,以便其他人可以为您的数据制定解决方案?DFs不可进行行扩展:追加一行需要线性的时间和空间。因此,如果在一个循环中追加n行,则循环将花费O(n^2)时间,这会迅速膨胀。编辑好+1:)。我所做的唯一更改是,您不需要显式地添加一个系列
    df['time\u diff']
    ,然后再删除它。您可以将其存储在一个变量中,即
    time\u diff=df['datetime'].diff()
    ,并使用布尔序列进行比较/索引。我尝试了这种方法,但无法完全实现。代码被完全复制和粘贴。生成
    timediff
    工作正常,并生成正确的时差,但行
    empty['datetime']=empty['datetime']-(empty['time_diff'].shift(-1)/2)
    在打印结果序列时出于某种原因生成NaT。shift怎么知道周期是什么?没关系,我知道了
    (空['time\u diff'].shift(-1)/2)
    只需更改为
    (空['time\u diff']/2)
    我测试了它,它工作得很好,将NaN的行放在相邻数字的中间。谢谢你的帮助。我唯一的其他经验是使用C语言,似乎使用python及其库在概念上比我想象的更为不同!看来我得学着做了。
    # after you read in your data, make sure the time column is actually a datetime
    df['datetime'] = pd.to_datetime(df['datetime'])
    
    # calculate the (time) difference between a row and the previous row
    df['time_diff'] = df['datetime'].diff()
    
    # create a subset of your df where the time difference is greater than
    # some threshold. This will be a dataframe of your empty/NA rows.
    # I've set a 2 second threshold here because of the sample data you provided, 
    # but could be any number of seconds
    empty = df[df['time_diff'].dt.total_seconds() > 2].copy()
    
    # calculate the correct timestamp for the NA rows (halfway and evenly spaced)
    empty['datetime'] = empty['datetime'] - (empty['time_diff'].shift(-1) / 2)
    
    # set all the columns to NA apart from the datetime column
    empty.loc[:, ~empty.columns.isin(['datetime'])] = np.nan
    
    # append this NA/empty dataframe to your original data, and sort by time
    df = df.append(empty, ignore_index=True)
    df = df.sort_values('datetime').reset_index(drop=True)
    
    # optionally, remove the time_diff column we created at the beginning
    df.drop('time_diff', inplace=True, axis=1)