Python 迭代数据帧并插入行的最快方法_Python_Pandas_Performance_Dataframe

Python 迭代数据帧并插入行的最快方法

python pandas performance dataframe

Python 迭代数据帧并插入行的最快方法,python,pandas,performance,dataframe,Python,Pandas,Performance,Dataframe,我正在构建一个工具，帮助每周自动审查几个实验室设置的数据。每天都会生成一个以制表符分隔的文本文件。每行表示每2秒获取的数据，因此有43200行和许多列（每个文件为75mb）我正在使用pandas.readcsv加载七个文本文件，并且只将我需要的三列提取到pandas数据框中。这比我想要的慢，但可以接受。然后，我使用Plotly offline打印数据以查看交互式打印。这是一项计划任务，设置为每周运行一次绘制数据与日期和时间的对比图。通常情况下，测试设置暂时脱机，数据中会出现间隙。不幸的是，当

我正在构建一个工具，帮助每周自动审查几个实验室设置的数据。每天都会生成一个以制表符分隔的文本文件。每行表示每2秒获取的数据，因此有43200行和许多列（每个文件为75mb）

我正在使用pandas.readcsv加载七个文本文件，并且只将我需要的三列提取到pandas数据框中。这比我想要的慢，但可以接受。然后，我使用Plotly offline打印数据以查看交互式打印。这是一项计划任务，设置为每周运行一次

绘制数据与日期和时间的对比图。通常情况下，测试设置暂时脱机，数据中会出现间隙。不幸的是，当绘制此图时，所有数据都是通过线连接的，即使测试脱机数小时或数天

防止这种情况发生的唯一方法是插入一行，在两个日期之间插入一个日期，其中包含实际数据，并为所有缺少的数据插入一个NAN。我已经很容易地为丢失的数据文件实现了这一点，但是我想将其推广到大于某个时间段的任何数据间隙。我提出了一个解决方案，似乎有效，但速度非常慢：

# alldata is a pandas dataframe with 302,000 rows and 4 columns
# one datetime column and three float32 columns

alldata_gaps  = pandas.DataFrame() #new dataframe with gaps in it

#iterate over all rows. If the datetime difference between 
#two consecutive rows is more than one minute, insert a gap row.

for i in range(0, len(alldata)):
    alldata_gaps = alldata_gaps.append(alldata.iloc[i])
    if alldata.iloc[i+1, 0]-alldata.iloc[i,0] > datetime.timedelta(minutes=1):
        Series = pandas.Series({'datetime' : alldata.iloc[i,0]
        +datetime.timedelta(seconds=3)})
        alldata_gaps = alldata_gaps.append(Series)
        print(Series)

有没有人建议我如何加快这项行动，这样就不会花这么长的时间

几乎可以肯定，您的瓶颈来自：

另一方面，您将一个变量命名为与Pandas对象相同的变量

pd.Series

。避免这种模棱两可的做法是很好的

更有效的解决方案是：

确定差距出现的时间

使用这些时间+3秒的数据创建单个数据帧

附加到现有数据帧并按时间排序

让我们用一个示例数据帧来尝试一下：

# example dataframe setup
df = pd.DataFrame({'Date': ['00:10:15', '00:15:20', '00:15:40', '00:16:50', '00:17:55',
                            '00:19:00', '00:19:10', '00:19:15', '00:19:55', '00:20:58'],
                   'Value': list(range(10))})

df['Date'] = pd.to_datetime('2018-11-06-' + df['Date'])

# find gaps greater than 1 minute
bools = (df['Date'].diff().dt.seconds > 60).shift(-1).fillna(False)
idx = bools[bools].index
# Int64Index([0, 2, 3, 4, 8], dtype='int64')

# construct dataframe to append
df_extra = df.loc[idx].copy().assign(Value=np.nan)

# add 3 seconds
df_extra['Date'] = df_extra['Date'] + pd.to_timedelta('3 seconds')

# append to original
res = df.append(df_extra).sort_values('Date')

结果:

print(res)

                 Date  Value
0 2018-11-06 00:10:15    0.0
0 2018-11-06 00:10:18    NaN
1 2018-11-06 00:15:20    1.0
2 2018-11-06 00:15:40    2.0
2 2018-11-06 00:15:43    NaN
3 2018-11-06 00:16:50    3.0
3 2018-11-06 00:16:53    NaN
4 2018-11-06 00:17:55    4.0
4 2018-11-06 00:17:58    NaN
5 2018-11-06 00:19:00    5.0
6 2018-11-06 00:19:10    6.0
7 2018-11-06 00:19:15    7.0
8 2018-11-06 00:19:55    8.0
8 2018-11-06 00:19:58    NaN
9 2018-11-06 00:20:58    9.0

我的总体想法与jpp的答案相同：与其迭代数据帧（这对于您拥有的数据量来说很慢），不如只识别感兴趣的行并处理它们。主要区别在于1）将多列转换为NA，2）将NA行时间戳调整为周围时间的一半

我在评论中添加了解释

# after you read in your data, make sure the time column is actually a datetime
df['datetime'] = pd.to_datetime(df['datetime'])

# calculate the (time) difference between a row and the previous row
df['time_diff'] = df['datetime'].diff()

# create a subset of your df where the time difference is greater than
# some threshold. This will be a dataframe of your empty/NA rows.
# I've set a 2 second threshold here because of the sample data you provided, 
# but could be any number of seconds
empty = df[df['time_diff'].dt.total_seconds() > 2].copy()

# calculate the correct timestamp for the NA rows (halfway and evenly spaced)
empty['datetime'] = empty['datetime'] - (empty['time_diff'].shift(-1) / 2)

# set all the columns to NA apart from the datetime column
empty.loc[:, ~empty.columns.isin(['datetime'])] = np.nan

# append this NA/empty dataframe to your original data, and sort by time
df = df.append(empty, ignore_index=True)
df = df.sort_values('datetime').reset_index(drop=True)

# optionally, remove the time_diff column we created at the beginning
df.drop('time_diff', inplace=True, axis=1)

这会给你类似的东西：

您能否提供一个简单的示例，说明您的数据是什么样子的，以便其他人可以为您的数据制定解决方案？DFs不可进行行扩展：追加一行需要线性的时间和空间。因此，如果在一个循环中追加n行，则循环将花费O（n^2）时间，这会迅速膨胀。编辑好+1:）。我所做的唯一更改是，您不需要显式地添加一个系列

df['time\u diff']

，然后再删除它。您可以将其存储在一个变量中，即

time\u diff=df['datetime'].diff（）

，并使用布尔序列进行比较/索引。我尝试了这种方法，但无法完全实现。代码被完全复制和粘贴。生成

timediff

工作正常，并生成正确的时差，但行

empty['datetime']=empty['datetime']-（empty['time_diff'].shift（-1）/2）

在打印结果序列时出于某种原因生成NaT。shift怎么知道周期是什么？没关系，我知道了

（空['time\u diff'].shift（-1）/2）

只需更改为

（空['time\u diff']/2）

我测试了它，它工作得很好，将NaN的行放在相邻数字的中间。谢谢你的帮助。我唯一的其他经验是使用C语言，似乎使用python及其库在概念上比我想象的更为不同！看来我得学着做了。

# after you read in your data, make sure the time column is actually a datetime
df['datetime'] = pd.to_datetime(df['datetime'])

# calculate the (time) difference between a row and the previous row
df['time_diff'] = df['datetime'].diff()

# create a subset of your df where the time difference is greater than
# some threshold. This will be a dataframe of your empty/NA rows.
# I've set a 2 second threshold here because of the sample data you provided, 
# but could be any number of seconds
empty = df[df['time_diff'].dt.total_seconds() > 2].copy()

# calculate the correct timestamp for the NA rows (halfway and evenly spaced)
empty['datetime'] = empty['datetime'] - (empty['time_diff'].shift(-1) / 2)

# set all the columns to NA apart from the datetime column
empty.loc[:, ~empty.columns.isin(['datetime'])] = np.nan

# append this NA/empty dataframe to your original data, and sort by time
df = df.append(empty, ignore_index=True)
df = df.sort_values('datetime').reset_index(drop=True)

# optionally, remove the time_diff column we created at the beginning
df.drop('time_diff', inplace=True, axis=1)