Python 检查两个df之间是否存在相同的模式，并在模式中使用groupby_Python_Pandas

Python 检查两个df之间是否存在相同的模式，并在模式中使用groupby

python pandas

Python 检查两个df之间是否存在相同的模式，并在模式中使用groupby,python,pandas,Python,Pandas,您好，我有一个df1文件，例如： Acc_number ACC1.1_CP_Sp1_1 ACC2.1_CP_Sp1_1 ACC3.1_CP_Sp1_1 ACC4.1_CP_Sp1_1 Cluster_nb SeqName Cluster1 YP_009216714 Cluster1 YP_002051918 Cluster1 JZSA01005235.1:37071-37973(-):Sp1_1 Cluster1 NW_014464344.1:68901-69716(-

您好，我有一个df1文件，例如：

Acc_number
ACC1.1_CP_Sp1_1
ACC2.1_CP_Sp1_1
ACC3.1_CP_Sp1_1
ACC4.1_CP_Sp1_1

Cluster_nb SeqName
Cluster1    YP_009216714
Cluster1    YP_002051918
Cluster1    JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1    NW_014464344.1:68901-69716(-):Sp2_3
Cluster1    YP_001956729
Cluster1    ACC1.1_CP_Sp1_1
Cluster1    YP_009213712
Cluster2    ACC2.1_CP_Sp1_1
Cluster2    NR_014464231.1:35866-36717(-):Sp1_1
Cluster2    NR_014464232.1:35889-36788(-):Sp1_1
Cluster2    YP_009213728
Cluster3    ACC3.1_CP_Sp1_1
Cluster3    NK_014464231.1:35772-38898(-):Sp1_2
Cluster3    NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3    YP_009213723
Cluster3    YP_009213739

和另一个df2，例如：

Acc_number
ACC1.1_CP_Sp1_1
ACC2.1_CP_Sp1_1
ACC3.1_CP_Sp1_1
ACC4.1_CP_Sp1_1

Cluster_nb SeqName
Cluster1    YP_009216714
Cluster1    YP_002051918
Cluster1    JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1    NW_014464344.1:68901-69716(-):Sp2_3
Cluster1    YP_001956729
Cluster1    ACC1.1_CP_Sp1_1
Cluster1    YP_009213712
Cluster2    ACC2.1_CP_Sp1_1
Cluster2    NR_014464231.1:35866-36717(-):Sp1_1
Cluster2    NR_014464232.1:35889-36788(-):Sp1_1
Cluster2    YP_009213728
Cluster3    ACC3.1_CP_Sp1_1
Cluster3    NK_014464231.1:35772-38898(-):Sp1_2
Cluster3    NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3    YP_009213723
Cluster3    YP_009213739

如果包含

Acc\u编号[I]

的

groupby

Cluster\u nb

在其

（+或-：…

部分中还包含另一个具有相同扩展名的序列（在

Acc\u编号

中

后面的部分），我想检查df1中的每个Acc\u编号
比如说
for ACC1.1_CP_Sp1_1 as i

我通过做一个
df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)

Cluster_nb SeqName
Cluster1    YP_009216714
Cluster1    YP_002051918
Cluster1    JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1    NW_014464344.1:68901-69716(-):Sp2_3
Cluster1    YP_001956729

df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)

Cluster3    ACC3.1_CP_Sp1_1
Cluster3    NK_014464231.1:35772-38898(-):Sp1_2
Cluster3    NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3    YP_009213723
Cluster3    YP_009213739

第3行中的序列JZSA01005235.1:37071-37973（-）：Sp1_1
在其末端具有相同的Sp1_1
模式
因此，这里的答案是肯定的，ACC1.1\u CP\u Sp1\u 1与另一个序列位于同一簇中，具有相同的结尾（但名称中有（-or+）：
）
我通过做一个
df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)

Cluster_nb SeqName
Cluster1    YP_009216714
Cluster1    YP_002051918
Cluster1    JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1    NW_014464344.1:68901-69716(-):Sp2_3
Cluster1    YP_001956729

df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)

Cluster3    ACC3.1_CP_Sp1_1
Cluster3    NK_014464231.1:35772-38898(-):Sp1_2
Cluster3    NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3    YP_009213723
Cluster3    YP_009213739

我发现在集群中没有其他序列的结尾与ACC3.1\u CP\u Sp1\u 1
相同，因此答案是否定的
结果应总结在df3中：
Acc_number present cluster
ACC1.1_CP_Sp1_1 Yes Cluster1
ACC2.1_CP_Sp1_1 Yes Cluster2
ACC3.1_CP_Sp1_1 No NaN
ACC4.1_CP_Sp1_1 No NaN

非常感谢你的帮助
我试过：
for CP in df1['Acc_number']:
  df=df2.loc[df2['SeqName']==CP]
  try: 
    Cluster_number=df['Cluster_nb'].iloc[0]
    df3=df2.loc[df2['Cluster_nb']==Cluster_number]
    for a in df3['SeqName']:
      if '(+)' in a or '(-)' in a:
        if re.sub('.*_CP_','',CP) in a:
          new_df=new_df.append({"Cluster":Cluster_number,"Acc_nb":CP,"present":'yes'}, ignore_index=True)
          print(CP,'yes')
  except:
    continue

我在代码本身中做了评论；概述是为每行获取唯一标识符，合并数据帧并仅保留您感兴趣的列：
  #create an 'ending' column 
  #where u split off the ends after ':'
  df1['ending'] = df1.loc[df1.SeqName.str.contains(':'),'SeqName']
  df1['ending'] = df1['ending'].str.split(':').str[-1]
  #get the cluster number and add to the ending column
  #it will serve as a unique identifier for each row
  df1['ending'] = df1.Cluster_nb.str[-1].str.cat(df1['ending'],sep='_')
  #get rid of null and duplicates; keep only relevant columns
  df1 = df1.dropna().drop('SeqName',axis=1).drop_duplicates('ending')

  #create ending column here as well
  df['ending'] = df['Acc_number'].str.extract(r'((?<=ACC)\d)')
  #merge acc_number with the ending to serve as unique identifier
  df['ending'] = df['ending'].str.cat(df['Acc_number'].str.extract(r'((?<=P_).*)'),sep='_')

  #merge both dataframes
  (df
  .merge(df1,on='ending',how='left')
   #keep only relevant columns
  .filter(['Acc_number','Cluster_nb'])
  #create present column
  .assign(present = lambda x: np.where(x.Cluster_nb.isna(),'no','yes'))
  .rename(columns={'Cluster_nb':'cluster'})
  )

     Acc_number     cluster     present
0   ACC1.1_CP_Sp1_1 Cluster1    yes
1   ACC2.1_CP_Sp1_1 Cluster2    yes
2   ACC3.1_CP_Sp1_1 NaN         no
3   ACC4.1_CP_Sp1_1 NaN         no

#创建“结束”列
#你在“：”之后把两端分开
df1['ending']=df1.loc[df1.SeqName.str.contains（'：'），'SeqName']
df1['ending']=df1['ending'].str.split（'：'）.str[-1]
#获取群集编号并添加到结束列
#它将作为每行的唯一标识符
df1['ending']=df1.Cluster\u nb.str[-1].str.cat（df1['ending']，sep=''u1'）
#消除空的和重复的；只保留相关列
df1=df1.dropna（）.drop（'SeqName'，axis=1）。drop_duplicates（'ending'））
#在这里也创建结束列
df['ending']=df['Acc_number'].str.extract（r'）（？