Python 使用切片列表从数据帧获取行_Python_Pandas_Performance_Dataframe_Indexing

Python 使用切片列表从数据帧获取行

python pandas performance dataframe indexing

Python 使用切片列表从数据帧获取行,python,pandas,performance,dataframe,indexing,Python,Pandas,Performance,Dataframe,Indexing,我有一个几百万行的数据框，还有一个我需要从中选择的有趣部分的列表。我正在寻找一种高效的阅读方式：尽可能快的方法我知道我能做到： slices = [slice(0,10), slice(20,50), slice(1000,5000)] for slice in slices: df.loc[slice, 'somecolumn'] = True 。。。但这似乎是一种效率低下的完成工作的方式。真的很慢这似乎比上面的for循环快，但我不确定这是否是最好的方法： from itertool

我有一个几百万行的数据框，还有一个我需要从中选择的有趣部分的列表。我正在寻找一种高效的阅读方式：尽可能快的方法

我知道我能做到：

slices = [slice(0,10), slice(20,50), slice(1000,5000)]
for slice in slices:
  df.loc[slice, 'somecolumn'] = True

。。。但这似乎是一种效率低下的完成工作的方式。真的很慢

这似乎比上面的for循环快，但我不确定这是否是最好的方法：

from itertools import chain
ranges = chain.from_iterable(slices)
df.loc[ranges, 'somecolumns'] = True

这也不起作用，尽管它似乎应该：

df.loc[slices, 'somecolumns'] = True

TypeError: unhashable type: 'slice'

我主要关心的是性能。我需要我能从中得到的最好的，因为我正在处理的数据帧的大小

IIUC，您希望在axis=0行索引上切片。我使用的不是切片，而是numpy的arange方法，并使用df.ix：

IIUC，您希望在轴=0行索引上切片。我使用的不是切片，而是numpy的arange方法，并使用df.ix：

熊猫您可以尝试以下几个技巧：

用于将切片对象连接到单个NumPy数组中。使用NumPy数组进行索引通常是有效的，因为这些数组在Pandas框架内部使用。使用位置整数索引，而不是主要基于标签。前者更具限制性，与NumPy索引更为一致。下面是一个演示：

# some example dataframe
df = pd.DataFrame(dict(zip('ABCD', np.arange(100).reshape((4, 25)))))

# concatenate multiple slices
slices = np.r_[slice(0, 3), slice(6, 10), slice(15, 20)]

# use integer indexing
df.iloc[slices, df.columns.get_loc('C')] = 0

努比如果序列保存在连续内存块中（通常是数字或布尔数组），则可以尝试就地更新基础NumPy数组。首先通过上述np.r_u定义切片，然后使用：

df['C'].values[slices] = 0

这会绕过Pandas接口和通过常规索引方法进行的任何相关检查。

Pandas 您可以尝试以下几个技巧：

# some example dataframe
df = pd.DataFrame(dict(zip('ABCD', np.arange(100).reshape((4, 25)))))

# concatenate multiple slices
slices = np.r_[slice(0, 3), slice(6, 10), slice(15, 20)]

# use integer indexing
df.iloc[slices, df.columns.get_loc('C')] = 0

努比如果序列保存在连续内存块中（通常是数字或布尔数组），则可以尝试就地更新基础NumPy数组。首先通过上述np.r_u定义切片，然后使用：

df['C'].values[slices] = 0

这会绕过Pandas接口和通过常规索引方法进行的任何相关检查。

您可以尝试先为行构建完整索引器，然后执行分配：

row_indexer = pd.concat((df.index[sub_slice] for sub_slice in slices), axis=0)
df[row_indexer, column] = True

您可以尝试先为行构建完整索引器，然后执行分配：

row_indexer = pd.concat((df.index[sub_slice] for sub_slice in slices), axis=0)
df[row_indexer, column] = True

注九自v0.20.0以来已被弃用。改用loc。注：自v0.20.0起，ix已被弃用。使用loc。如果您已经创建了切片列表，并且希望将其用于pandas和np.r\n，该怎么办？如果不将切片转换为数组，则直接使用np.r_u3;[slices]将不起作用。我可以做一些类似于np.concatenate[np.r\s]的事情，用于切片中的s]，然后将其用作数据帧的索引。有没有更好的方法？如果您已经创建了切片列表，并且希望将其用于pandas和np.r\n呢？如果不将切片转换为数组，则直接使用np.r_u3;[slices]将不起作用。我可以做一些类似于np.concatenate[np.r\s]的事情，用于切片中的s]，然后将其用作数据帧的索引。还有更好的办法吗？