Python 熊猫：基于列对选择多行_Python_Pandas

Python 熊猫：基于列对选择多行

python pandas

Python 熊猫：基于列对选择多行,python,pandas,Python,Pandas,我现在正在将我的数据分析管道从宽格式调整为整齐/长格式，但在过滤它时遇到了问题，我就是不能对它掉以轻心我的数据（简化）如下所示（显微镜强度数据）：在一组的每次测量中，我有几个感兴趣的区域=roi，其中我在几个时间点上观察强度（=值） roi基本上是显微镜图像中的单个细胞。我正在跟踪强度（=值）随时间（=时间点）的变化。我重复这个实验几次，每次看几个细胞我的目标是过滤掉所有时间点的测量ROI，这些时间点的强度值高于我在时间点0设置的阈值（我认为这些ROI已预激活）返回 timepoint

我现在正在将我的数据分析管道从宽格式调整为整齐/长格式，但在过滤它时遇到了问题，我就是不能对它掉以轻心

我的数据（简化）如下所示（显微镜强度数据）：在一组的每次测量中，我有几个感兴趣的区域=roi，其中我在几个时间点上观察强度（=值）

roi基本上是显微镜图像中的单个细胞。我正在跟踪强度（=值）随时间（=时间点）的变化。我重复这个实验几次，每次看几个细胞

我的目标是过滤掉所有时间点的测量ROI，这些时间点的强度值高于我在时间点0设置的阈值（我认为这些ROI已预激活）

  timepoint     measurement     roi     value   group
0   0                 1          1       0.1    control
1   1                 1          1       0.2    control
2   2                 1          1       0.3    control
3   3                 1          1       0.4    control
4   0                 1          2       0.1    control
5   1                 1          2       0.2    control
6   2                 1          2       0.3    control
7   3                 1          2       0.4    control
8   0                 1          3       0.5    control
9   1                 1          3       0.6    control
10  2                 1          3       0.8    control
11  3                 1          3       0.9    control
12  0                 2          1       0.1    control
13  1                 2          1       0.2    control
14  2                 2          1       0.3    control
15  3                 2          1       0.4    control
16  0                 3          1       0.5    control
17  1                 3          1       0.6    control
18  2                 3          1       0.8    control
19  3                 3          1       0.9    control
20  0                 3          2       0.1    control
21  1                 3          2       0.2    control
22  2                 3          2       0.3    control
23  3                 3          2       0.4    control

现在，我可以选择包含ROI的行，该ROI的值在时间点0高于我的阈值

    threshold = 0.4
    pre_activated = df.loc[(df['timepoint'] == 0) & (df['value'] > threshold)]
    pre_activated

timepoint   measurement     roi     value   group
8   0            1           3       0.5    control
16  0            3           1       0.5    control

现在，我想从原始数据帧

df

中筛选出所有时间点0到3的那些单元格（例如，测量值1，roi 3）-这就是我现在遇到的问题

如果我使用

.isin

df.loc[~(df['measurement'].isin(pre_activated["measurement"]) & df['roi'].isin(pre_activated["roi"]))]

我将很接近，但是

measurement 1

和

roi 1

对的所有内容都丢失了（因此我认为这是条件表达式的问题）

我知道我至少可以使用

.query

进行一对测量和roi

df[~df.isin(df.query('measurement == 1 & roi == 3'))]

虽然所有的整数都转换为浮点，但这会带来一些接近。此外，“组”列现在是NaN，这使得当有多个组具有多个测量值和每个数据帧的ROI时变得困难

   timepoint    measurement          roi     value  group
    0   0.0                   1.0        1.0     0.1    control
    1   1.0                   1.0        1.0     0.2    control
    2   2.0                   1.0        1.0     0.3    control
    3   3.0                   1.0        1.0     0.4    control
    4   0.0                   1.0        2.0     0.1    control
    5   1.0                   1.0        2.0     0.2    control
    6   2.0                   1.0        2.0     0.3    control
    7   3.0                   1.0        2.0     0.4    control
    8   NaN                   NaN        NaN     NaN    NaN
    9   NaN                   NaN        NaN     NaN    NaN
    10  NaN                   NaN        NaN     NaN    NaN
    11  NaN                   NaN        NaN     NaN    NaN
    12  0.0                   2.0        1.0     0.1    control
    13  1.0                   2.0        1.0     0.2    control
    14  2.0                   2.0        1.0     0.3    control
    15  3.0                   2.0        1.0     0.4    control
    16  0.0                   3.0        1.0     0.5    control
    17  1.0                   3.0        1.0     0.6    control
    18  2.0                   3.0        1.0     0.8    control
    19  3.0                   3.0        1.0     0.9    control
    20  0.0                   3.0        2.0     0.1    control
    21  1.0                   3.0        2.0     0.2    control
    22  2.0                   3.0        2.0     0.3    control
    23  3.0                   3.0        2.0     0.4    control

我尝试使用dict存储

度量值

：

roi

对，以避免任何混淆，但不知道这是否有用：

msmt_list = pre_activated["measurement"].values
roi_list = pre_activated["roi"].values

mydict={}
for i in range(len(msmt_list)):
    mydict[msmt_list[i]]=roi_list[i]

输出

   mydict
    {1: 3, 3: 1}

实现我想做的事情的最佳方式是什么？我将非常感谢您的任何意见，也感谢您在效率方面的意见，因为我通常处理3-4个组，每个组有4-8个测量值，最多200个ROI，通常是360个时间点

谢谢

编辑：只是为了澄清我想要的输出数据帧应该是什么样子

''df_pre_activated'（这些是“roi”，在时间点0时其值高于我的阈值）

'df_filtered'（这基本上是初始的'df'，没有上面显示的'df_pre_activated'中的数据）

解决办法如下：

首先，我们通过过滤

df

计算

df\u pre\u activated\u t0

，条件如下：

阈值=0.4
df_pre_activated_t0=df[（df['timepoint']==0）和（df['value']>threshold）]

df_pre_activated_t0

如下所示：

    timepoint  measurement  roi  value    group
8           0            1    3    0.5  control
16          0            3    1    0.5  control

   timepoint  measurement  roi  value    group
0          0            1    3    0.5  control
1          1            1    3    0.6  control
2          2            1    3    0.8  control
3          3            1    3    0.9  control
4          0            3    1    0.5  control
5          1            3    1    0.6  control
6          2            3    1    0.8  control
7          3            3    1    0.9  control

    timepoint  measurement  roi  value  group_x group_y
0           0            1    1    0.1  control     NaN
1           1            1    1    0.2  control     NaN
2           2            1    1    0.3  control     NaN
3           3            1    1    0.4  control     NaN
4           0            1    2    0.1  control     NaN
5           1            1    2    0.2  control     NaN
6           2            1    2    0.3  control     NaN
7           3            1    2    0.4  control     NaN
12          0            2    1    0.1  control     NaN
13          1            2    1    0.2  control     NaN
14          2            2    1    0.3  control     NaN
15          3            2    1    0.4  control     NaN
20          0            3    2    0.1  control     NaN
21          1            3    2    0.2  control     NaN
22          2            3    2    0.3  control     NaN
23          3            3    2    0.4  control     NaN

    timepoint  measurement  roi  value    group
0           0            1    1    0.1  control
1           1            1    1    0.2  control
2           2            1    1    0.3  control
3           3            1    1    0.4  control
4           0            1    2    0.1  control
5           1            1    2    0.2  control
6           2            1    2    0.3  control
7           3            1    2    0.4  control
12          0            2    1    0.1  control
13          1            2    1    0.2  control
14          2            2    1    0.3  control
15          3            2    1    0.4  control
20          0            3    2    0.1  control
21          1            3    2    0.2  control
22          2            3    2    0.3  control
23          3            3    2    0.4  control

我们通过合并

df

和

df\u-pre\u-activated\u t0

（内部合并）计算

df\u-pre\u-activated

：

df_pre_activated=df.merge(
df_pre_activated_t0[[“测量”，“投资回报率”]]，how=“内部”，on=[“测量”，“投资回报率”]
)

df_pre_activated

如下所示：

    timepoint  measurement  roi  value    group
8           0            1    3    0.5  control
16          0            3    1    0.5  control

   timepoint  measurement  roi  value    group
0          0            1    3    0.5  control
1          1            1    3    0.6  control
2          2            1    3    0.8  control
3          3            1    3    0.9  control
4          0            3    1    0.5  control
5          1            3    1    0.6  control
6          2            3    1    0.8  control
7          3            3    1    0.9  control

    timepoint  measurement  roi  value  group_x group_y
0           0            1    1    0.1  control     NaN
1           1            1    1    0.2  control     NaN
2           2            1    1    0.3  control     NaN
3           3            1    1    0.4  control     NaN
4           0            1    2    0.1  control     NaN
5           1            1    2    0.2  control     NaN
6           2            1    2    0.3  control     NaN
7           3            1    2    0.4  control     NaN
12          0            2    1    0.1  control     NaN
13          1            2    1    0.2  control     NaN
14          2            2    1    0.3  control     NaN
15          3            2    1    0.4  control     NaN
20          0            3    2    0.1  control     NaN
21          1            3    2    0.2  control     NaN
22          2            3    2    0.3  control     NaN
23          3            3    2    0.4  control     NaN

    timepoint  measurement  roi  value    group
0           0            1    1    0.1  control
1           1            1    1    0.2  control
2           2            1    1    0.3  control
3           3            1    1    0.4  control
4           0            1    2    0.1  control
5           1            1    2    0.2  control
6           2            1    2    0.3  control
7           3            1    2    0.4  control
12          0            2    1    0.1  control
13          1            2    1    0.2  control
14          2            2    1    0.3  control
15          3            2    1    0.4  control
20          0            3    2    0.1  control
21          1            3    2    0.2  control
22          2            3    2    0.3  control
23          3            3    2    0.4  control

为了计算

df\u-filtered

（

df

没有

df\u-pre\u-activated

的行），我们在

df

和

df\u-pre\u-activated

之间进行左合并，并在

df\u-pre\u-activated>中保留值不的行：
df_filtered=df.merge(
df_预_激活，
how=“left”，
on=[“时间点”、“测量”、“投资回报率”、“价值”]
)
df_filtered=df_filtered[pd.isna（df_filtered[“group_y”]）]

df_filtered
如下所示：
    timepoint  measurement  roi  value    group
8           0            1    3    0.5  control
16          0            3    1    0.5  control

   timepoint  measurement  roi  value    group
0          0            1    3    0.5  control
1          1            1    3    0.6  control
2          2            1    3    0.8  control
3          3            1    3    0.9  control
4          0            3    1    0.5  control
5          1            3    1    0.6  control
6          2            3    1    0.8  control
7          3            3    1    0.9  control

    timepoint  measurement  roi  value  group_x group_y
0           0            1    1    0.1  control     NaN
1           1            1    1    0.2  control     NaN
2           2            1    1    0.3  control     NaN
3           3            1    1    0.4  control     NaN
4           0            1    2    0.1  control     NaN
5           1            1    2    0.2  control     NaN
6           2            1    2    0.3  control     NaN
7           3            1    2    0.4  control     NaN
12          0            2    1    0.1  control     NaN
13          1            2    1    0.2  control     NaN
14          2            2    1    0.3  control     NaN
15          3            2    1    0.4  control     NaN
20          0            3    2    0.1  control     NaN
21          1            3    2    0.2  control     NaN
22          2            3    2    0.3  control     NaN
23          3            3    2    0.4  control     NaN

    timepoint  measurement  roi  value    group
0           0            1    1    0.1  control
1           1            1    1    0.2  control
2           2            1    1    0.3  control
3           3            1    1    0.4  control
4           0            1    2    0.1  control
5           1            1    2    0.2  control
6           2            1    2    0.3  control
7           3            1    2    0.4  control
12          0            2    1    0.1  control
13          1            2    1    0.2  control
14          2            2    1    0.3  control
15          3            2    1    0.4  control
20          0            3    2    0.1  control
21          1            3    2    0.2  control
22          2            3    2    0.3  control
23          3            3    2    0.4  control

最后，我们删除group_y列，并将列名设置为其原始值：
df_filtered.drop（“y组”，轴=1，原地=True）
df_filtered.columns=列表（df.columns）

df_filtered
如下所示：
    timepoint  measurement  roi  value    group
8           0            1    3    0.5  control
16          0            3    1    0.5  control

   timepoint  measurement  roi  value    group
0          0            1    3    0.5  control
1          1            1    3    0.6  control
2          2            1    3    0.8  control
3          3            1    3    0.9  control
4          0            3    1    0.5  control
5          1            3    1    0.6  control
6          2            3    1    0.8  control
7          3            3    1    0.9  control

    timepoint  measurement  roi  value  group_x group_y
0           0            1    1    0.1  control     NaN
1           1            1    1    0.2  control     NaN
2           2            1    1    0.3  control     NaN
3           3            1    1    0.4  control     NaN
4           0            1    2    0.1  control     NaN
5           1            1    2    0.2  control     NaN
6           2            1    2    0.3  control     NaN
7           3            1    2    0.4  control     NaN
12          0            2    1    0.1  control     NaN
13          1            2    1    0.2  control     NaN
14          2            2    1    0.3  control     NaN
15          3            2    1    0.4  control     NaN
20          0            3    2    0.1  control     NaN
21          1            3    2    0.2  control     NaN
22          2            3    2    0.3  control     NaN
23          3            3    2    0.4  control     NaN

    timepoint  measurement  roi  value    group
0           0            1    1    0.1  control
1           1            1    1    0.2  control
2           2            1    1    0.3  control
3           3            1    1    0.4  control
4           0            1    2    0.1  control
5           1            1    2    0.2  control
6           2            1    2    0.3  control
7           3            1    2    0.4  control
12          0            2    1    0.1  control
13          1            2    1    0.2  control
14          2            2    1    0.3  control
15          3            2    1    0.4  control
20          0            3    2    0.1  control
21          1            3    2    0.2  control
22          2            3    2    0.3  control
23          3            3    2    0.4  control

简单地说：
在：
输出：
这是由于数学逻辑思维。你在想。显示a不为1且b不为3的数据帧，这与显示a不为1或b为3的数据帧相同，从数据帧中删除1和3
必须使用a不是1或b不是3，这与a不是1和b不是3相同
希望这有帮助。排成一行
编辑：若要同时删除1:3和3:1，请将and条件和OR条件一起使用：
df[((df["measurement"] != 1) | (df["roi"] != 3)) & ((df["measurement"] != 3) | (df["roi"] != 1))]

Edit2：要直接删除已筛选的行，可以使用先筛选后删除的相反方法
在：
编辑3：
多种条件
threshold = 0.4
full_activated = df5[((df5['timepoint'] != 0) | (df5['value'] < threshold)) & ((df5["measurement"] != 1) | (df5["roi"] != 3)) & ((df5["measurement"] != 3) | (df5["roi"] != 1)) & ((df5["measurement"] != 1) | (df5["roi"] != 1)) ]
full_activated

感谢@Jose A.Jimenez和@Vioxini的回答。我接受了何塞的建议，它给了我想要的结果。我使用dask

inputdf.shape
(73124, 5)

仅使用熊猫：
import pandas as pd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
    
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
filtereddf = inputdf.merge(
    pre_activated,
    how="left",
    on=["timepoint", "measurement", "roi", "value"],  
    )
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)

需要2分钟9秒
现在使用dask
：
import dask.dataframe as dd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]   
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])

input_dd = dd.from_pandas(inputdf, npartitions=3)
pre_dd = dd.from_pandas(pre_activated, npartitions=3)

merger = dd.merge(input_dd,pre_dd, how="left", on=["timepoint", "measurement", "roi", "value"])
filtereddf = merger.compute()
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])] 
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)

现在只需要42.6秒：-）
这是我第一次使用dask，所以可能有一些我不知道的选项可以进一步提高速度，但现在还可以
再次感谢你的帮助
编辑：
在将pandas数据帧
转换为dask数据帧
并将其从3增加到npartitions=30时，我使用了npartitions
选项，进一步提高了性能：现在只需9.87秒。
您好，谢谢您的回答。在此上下文中，我不知道.merge函数-非常有用！但是如何获得过滤后的数据帧（因此原始的df
缺少df2
中的所有内容）？我尝试了df[~df.isin（df2）].dropna（）
（请参阅，但这不起作用。）不客气，我很乐意提供帮助。您想删除精确的整行吗？还是只删除df
中“测量”和“投资回报”的行是否在df2
？我想删除“df”中的所有行以及“df2”的度量和roi对。因此基本上“df”缺少“df2”（请参阅我问题最后的“df_filtered”）