在python中从大型数据帧中快速采样大量行_Python_Pandas_Dataframe_Sampling_Bigdata

在python中从大型数据帧中快速采样大量行

python pandas dataframe

在python中从大型数据帧中快速采样大量行,python,pandas,dataframe,sampling,bigdata,Python,Pandas,Dataframe,Sampling,Bigdata,我有一个非常大的数据帧（大约110万行），我正在尝试对它进行采样我有一个要从整个数据帧中选择的索引列表（大约70000个索引）这是我迄今为止尝试过的方法，但所有这些方法都花费了太多的时间：方法1-使用熊猫： sample = pandas.read_csv("data.csv", index_col = 0).reset_index() sample = sample[sample['Id'].isin(sample_index_array)] 方法2：我尝试将所有采样行写入另一个cs

我有一个非常大的数据帧（大约110万行），我正在尝试对它进行采样

我有一个要从整个数据帧中选择的索引列表（大约70000个索引）

这是我迄今为止尝试过的方法，但所有这些方法都花费了太多的时间：

方法1-使用熊猫：

sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]

方法2：

我尝试将所有采样行写入另一个csv

f = open("data.csv",'r')

out  = open("sampled_date.csv", 'w')
out.write(f.readline())

while 1:
    total += 1
    line = f.readline().strip()

    if line =='':
        break
    arr = line.split(",")

    if (int(arr[0]) in sample_index_array):
        out.write(",".join(e for e in (line)))

有谁能建议一个更好的方法吗？或者我如何修改它以使其更快

谢谢

我们没有您的数据，因此这里有一个示例，有两个选项：

阅读后：使用

索引对象通过.iloc


读时：带
给定的
索引集合和写入test.csv
的（大）样本DataFrame
：
import pandas as pd
import numpy as np


indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()

输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A    1000000 non-null int32
B    1000000 non-null int32
C    1000000 non-null int32
D    1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB

计算速度更快，但需要标量索引

选项2-阅读时（推荐）
我们可以编写一个谓词，在读取文件时保留选定的索引（效率更高）：
另见

结果
后一个选项产生相同的输出：
        A   B   C   D
1      74  95  28   4
2      87   3  49  94
3      53  54  34  97
10     58  41  48  15
20     86  20  92  11
30     36  59  22   5
67     49  23  86  63
78     98  63  60  75
900    26  11  71  85
2176   12  73  58  91
78776  42  30  97  96

我们没有您的数据，因此下面是一个有两个选项的示例：
阅读后：使用索引对象通过.iloc

读时：带
给定的
索引集合和写入test.csv
的（大）样本DataFrame
：
import pandas as pd
import numpy as np


indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()

输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A    1000000 non-null int32
B    1000000 non-null int32
C    1000000 non-null int32
D    1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB

计算速度更快，但需要标量索引

选项2-阅读时（推荐）
我们可以编写一个谓词，在读取文件时保留选定的索引（效率更高）：
另见

结果
后一个选项产生相同的输出：
        A   B   C   D
1      74  95  28   4
2      87   3  49  94
3      53  54  34  97
10     58  41  48  15
20     86  20  92  11
30     36  59  22   5
67     49  23  86  63
78     98  63  60  75
900    26  11  71  85
2176   12  73  58  91
78776  42  30  97  96

如果我没弄错的话，你可以把你的标记变成熊猫索引对象。然后将对象输入到数据框中，直接对其进行切片。如果我理解正确，您可能可以将标记转换为熊猫索引对象。然后将对象馈送到数据帧中，直接对其进行切片。谢谢！这应该管用！出于好奇，有没有一种方法可以在读取数据时对这些行进行切片？好的。我试试滑雪橇。谢谢谢谢这应该管用！出于好奇，有没有一种方法可以在读取数据时对这些行进行切片？好的。我试试滑雪橇。谢谢