Python 数据帧在行上迭代_Python_Pandas

Python 数据帧在行上迭代

python pandas

Python 数据帧在行上迭代,python,pandas,Python,Pandas,我有一个数据框，其中X和Y是细胞坐标，mRNA是每个细胞的mRNA数量 ID X Y mRNA 0 0 149.492 189.153 0 1 1 115.084 194.082 2 2 2 135.331 194.831 7 3 3 136.965 184.493 2 4 4 124.025 190.069 1 ...

我有一个数据框，其中X和Y是细胞坐标，mRNA是每个细胞的mRNA数量

        ID        X        Y  mRNA
0        0  149.492  189.153     0
1        1  115.084  194.082     2
2        2  135.331  194.831     7
3        3  136.965  184.493     2
4        4  124.025  190.069     1
...    ...      ...      ...   ...
2410  2410  452.596  256.313     0
2411  2411  196.448  333.959    46
2412  2412  190.779  318.418    71
2413  2413  202.941  335.446    37
2414  2414  254.967  369.431    13

目前，我正在尝试应用这个公式，但我无法真正使它发挥作用。理想情况下，我希望执行此操作：

For ID 0: sqrt[((X0-X1)^2)+((Y0-Y1)^2)]
          sqrt[((X0-X2)^2)+((Y0-Y2)^2)]
          ............
          sqrt[((X0-Xn)^2)+((Y0-Yn)^2)]

(where n is the last cell ID in my csv file 2414)

然后必须对所有单元格对ID 1执行相同的操作，然后对ID 2执行相同的操作，依此类推

import pandas as pd
import numpy as np

df=pd.read_csv('Detailed2.csv', sep=',')
print(df)

df1 = np.sqrt(((df['X'].sub(df['X']))^2).add((df['Y'].sub(df['Y']))^2)).to_frame('col')
print(df1)

此代码不起作用

使用：

for Id in df['ID']:
    df[f'new_col_{Id}']=( df[['X','Y']].sub(df.loc[df['ID'].eq(Id),['X','Y']].values)
                                     .pow(2)
                                     .sum(axis=1)
                                     .pow(1/2) )

print(df)

输出

          ID        X        Y  mRNA   new_col_0   new_col_1   new_col_2  \
0        0  149.492  189.153     0    0.000000   34.759251   15.256920   
1        1  115.084  194.082     2   34.759251    0.000000   20.260849   
2        2  135.331  194.831     7   15.256920   20.260849    0.000000   
3        3  136.965  184.493     2   13.365677   23.889895   10.466337   
4        4  124.025  190.069     1   25.483468    9.800288   12.267937   
2410  2410  452.596  256.313     0  310.455311  343.201176  323.167320   
2411  2411  196.448  333.959    46  152.228918  161.819886  151.960153   
2412  2412  190.779  318.418    71  135.698403  145.565016  135.455628   
2413  2413  202.941  335.446    37  155.751204  166.441079  156.024647   
2414  2414  254.967  369.431    13  208.866304  224.308996  211.655221   

       new_col_3   new_col_4  new_col_2410  new_col_2411  new_col_2412  \
0      13.365677   25.483468    310.455311    152.228918    135.698403   
1      23.889895    9.800288    343.201176    161.819886    145.565016   
2      10.466337   12.267937    323.167320    151.960153    135.455628   
3       0.000000   14.090258    323.698997    160.867375    144.332436   
4      14.090258    0.000000    335.182293    161.088246    144.670530   
2410  323.698997  335.182293      0.000000    267.657802    269.082093   
2411  160.867375  161.088246    267.657802      0.000000     16.542679   
2412  144.332436  144.670530    269.082093     16.542679      0.000000   
2413  164.741133  165.415257    261.896259      6.661097     20.925272   
2414  219.377610  222.073264    227.712326     68.430521     81.990399   

      new_col_2413  new_col_2414  
0       155.751204    208.866304  
1       166.441079    224.308996  
2       156.024647    211.655221  
3       164.741133    219.377610  
4       165.415257    222.073264  
2410    261.896259    227.712326  
2411      6.661097     68.430521  
2412     20.925272     81.990399  
2413      0.000000     62.142457  
2414     62.142457      0.000000

使用

itertuples的解决方案@Trenton McKinney和@Alexander Cécile（推荐）

应用解决方案
df.join(
df['ID'].apply(lambda x:
              df[['X','Y']].sub(df.loc[df['ID'].eq(x),['X','Y']].values)
                                  .pow(2)
                                  .sum(axis=1)
                                  .pow(1/2))
        .add_prefix('new_col_')
)

请记住，您不能有重复的ID
使用：
for Id in df['ID']:
    df[f'new_col_{Id}']=( df[['X','Y']].sub(df.loc[df['ID'].eq(Id),['X','Y']].values)
                                     .pow(2)
                                     .sum(axis=1)
                                     .pow(1/2) )

print(df)


输出
          ID        X        Y  mRNA   new_col_0   new_col_1   new_col_2  \
0        0  149.492  189.153     0    0.000000   34.759251   15.256920   
1        1  115.084  194.082     2   34.759251    0.000000   20.260849   
2        2  135.331  194.831     7   15.256920   20.260849    0.000000   
3        3  136.965  184.493     2   13.365677   23.889895   10.466337   
4        4  124.025  190.069     1   25.483468    9.800288   12.267937   
2410  2410  452.596  256.313     0  310.455311  343.201176  323.167320   
2411  2411  196.448  333.959    46  152.228918  161.819886  151.960153   
2412  2412  190.779  318.418    71  135.698403  145.565016  135.455628   
2413  2413  202.941  335.446    37  155.751204  166.441079  156.024647   
2414  2414  254.967  369.431    13  208.866304  224.308996  211.655221   

       new_col_3   new_col_4  new_col_2410  new_col_2411  new_col_2412  \
0      13.365677   25.483468    310.455311    152.228918    135.698403   
1      23.889895    9.800288    343.201176    161.819886    145.565016   
2      10.466337   12.267937    323.167320    151.960153    135.455628   
3       0.000000   14.090258    323.698997    160.867375    144.332436   
4      14.090258    0.000000    335.182293    161.088246    144.670530   
2410  323.698997  335.182293      0.000000    267.657802    269.082093   
2411  160.867375  161.088246    267.657802      0.000000     16.542679   
2412  144.332436  144.670530    269.082093     16.542679      0.000000   
2413  164.741133  165.415257    261.896259      6.661097     20.925272   
2414  219.377610  222.073264    227.712326     68.430521     81.990399   

      new_col_2413  new_col_2414  
0       155.751204    208.866304  
1       166.441079    224.308996  
2       156.024647    211.655221  
3       164.741133    219.377610  
4       165.415257    222.073264  
2410    261.896259    227.712326  
2411      6.661097     68.430521  
2412     20.925272     81.990399  
2413      0.000000     62.142457  
2414     62.142457      0.000000 

使用itertuples的解决方案@Trenton McKinney和@Alexander Cécile（推荐）

应用解决方案
df.join(
df['ID'].apply(lambda x:
              df[['X','Y']].sub(df.loc[df['ID'].eq(x),['X','Y']].values)
                                  .pow(2)
                                  .sum(axis=1)
                                  .pow(1/2))
        .add_prefix('new_col_')
)

请记住，您不能有重复的ID
我建议使用底层的numpy数组和scipy数组：
来自scipy.spatial导入距离矩阵的
arr=df[[“X”，“Y”]]至_numpy（）
距离=距离矩阵（arr，arr）
dist_col_names=“dist_to_u”+df[“ID”].astype（“str”）
对于col_name，col in zip（dist_col_name，dists）：
df[col_name]=col

这可能比在行中循环要快得多。
我建议使用底层numpy数组和scipy数组：
来自scipy.spatial导入距离矩阵的
arr=df[[“X”，“Y”]]至_numpy（）
距离=距离矩阵（arr，arr）
dist_col_names=“dist_to_u”+df[“ID”].astype（“str”）
对于col_name，col in zip（dist_col_name，dists）：
df[col_name]=col

这可能比在行中循环要快得多。
PMende在我工作时发布了一个NumPy解决方案，它甚至更好。他真了不起

我喜欢他的答案，因为它没有使用任何显式循环
raw\u str=\
'''
idxymrna
0        0  149.492  189.153     0
1        1  115.084  194.082     2
2        2  135.331  194.831     7
3        3  136.965  184.493     2
4        4  124.025  190.069     1
2410  2410  452.596  256.313     0
2411  2411  196.448  333.959    46
2412  2412  190.779  318.418    71
2413  2413  202.941  335.446    37
2414  2414  254.967  369.431    13
'''
df_1=pd.read_csv（StringIO（raw_str），header=0，delim_whitespace=True，usecols=[1,2,3,4]）
coords=df_1['X'，'Y']]to_numpy（）
距离=spsp.距离矩阵（坐标，坐标）
col_names=df_1['ID'].map（lambda x:f'col_ID{x}'）.rename（）
df_2=pd.DataFrame（数据=距离，列=列名称）
df_3=pd.concat（（df_1，df_2），轴=1）

这些额外的变量显然会影响性能，它们只是为了清晰起见

创建数千列有点疯狂，这是一个更合理的解决方案，它将距离保存为每行中的列表
从io导入StringIO
作为pd进口熊猫
将scipy.spatial作为spsp导入
原始长度=\
'''
idxymrna
0        0  149.492  189.153     0
1        1  115.084  194.082     2
2        2  135.331  194.831     7
3        3  136.965  184.493     2
4        4  124.025  190.069     1
2410  2410  452.596  256.313     0
2411  2411  196.448  333.959    46
2412  2412  190.779  318.418    71
2413  2413  202.941  335.446    37
2414  2414  254.967  369.431    13
'''
df_1=pd.read_csv（StringIO（raw_str），header=0，delim_whitespace=True，usecols=[1,2,3,4]）
coords=df_1['X'，'Y']]to_numpy（）
距离=spsp.距离矩阵（坐标，坐标）
df_1['dist']=距离。tolist（）

df_1
：
IDX。。。信使核糖核酸区
0     0  149.492  ...     0  [0.0, 34.759250639218344, 15.256919905406859, ...
1     1  115.084  ...     2  [34.759250639218344, 0.0, 20.26084919246971, 2...
2     2  135.331  ...     7  [15.256919905406859, 20.26084919246971, 0.0, 1...
3     3  136.965  ...     2  [13.36567727427235, 23.889894976746966, 10.466...
4     4  124.025  ...     1  [25.483468072458283, 9.800288261066603, 12.267...
5  2410  452.596  ...     0  [310.45531146366295, 343.201176433007, 323.167...
6  2411  196.448  ...    46  [152.2289183171187, 161.81988637061886, 151.96...
7  2412  190.779  ...    71  [135.69840306355857, 145.56501613025023, 135.4...
8  2413  202.941  ...    37  [155.75120368716253, 166.4410794996235, 156.02...
9  2414  254.967  ...    13  [208.86630390994137, 224.30899556192568, 211.6...
PMende在我工作的时候发布了一个NumPy解决方案，甚至更好。他真是太好了

我喜欢他的答案，因为它没有使用任何显式循环
raw\u str=\
'''
idxymrna
0        0  149.492  189.153     0
1        1  115.084  194.082     2
2        2  135.331  194.831     7
3        3  136.965  184.493     2
4        4  124.025  190.069     1
2410  2410  452.596  256.313     0
2411  2411  196.448  333.959    46
2412  2412  190.779  318.418    71
2413  2413  202.941  335.446    37
2414  2414  254.967  369.431    13
'''
df_1=pd.read_csv（StringIO（raw_str），header=0，delim_whitespace=True，usecols=[1,2,3,4]）
coords=df_1['X'，'Y']]to_numpy（）
距离=spsp.距离矩阵（坐标，坐标）
col_names=df_1['ID'].map（lambda x:f'col_ID{x}'）.rename（）
df_2=pd.DataFrame（数据=距离，列=列名称）
df_3=pd.concat（（df_1，df_2），轴=1）

这些额外的变量显然会影响性能，它们只是为了清晰起见

创建数千列有点疯狂，这是一个更合理的解决方案，它将距离保存为每行中的列表
从io导入StringIO
英普