Python 列出列之间的关系
我在下面有一个数据框:Python 列出列之间的关系,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我在下面有一个数据框: df=pd.DataFrame({'cnpj':[410000132,410000132,4830624000197,4830624000197,4830624000197],'Nome Pessoa':['EUGENIO LUPORINI NETO','JUAN MATIAS SERAGOPIAN','EUGENIO LUPORINI NETO','SIMONE FANKHAUSER','ALEX SOUZA']}) print(df) cnp
df=pd.DataFrame({'cnpj':[410000132,410000132,4830624000197,4830624000197,4830624000197],'Nome Pessoa':['EUGENIO LUPORINI NETO','JUAN MATIAS SERAGOPIAN','EUGENIO LUPORINI NETO','SIMONE FANKHAUSER','ALEX SOUZA']})
print(df)
cnpj Nome Pessoa
0 410000132 EUGENIO LUPORINI NETO
1 410000132 JUAN MATIAS SERAGOPIAN
2 4830624000197 EUGENIO LUPORINI NETO
3 4830624000197 SIMONE FANKHAUSER
4 4830624000197 ALEX SOUZA
每个cnpj
都是一家公司。每个Nome Pessoa
都是一个人。我想为每个Nome Pessoa
列出其他人以与他相同的cnpj
出现的人(最好不要重复)。换句话说,我将使用cnpj
作为键列出人们之间的关系,df看起来是这样的(或者至少接近它):
例如,df['Relations'][0]=['JUAN MATIAS SERAGOPIAN'、'SIMONE FANKHAUSER'、'ALEX SOUZA']
是这样的,因为JUAN MATIAS Seragopopian与EUGENIO LUPORINI NETO(410000132)出现在同一个cnpj中,SIMONE FANKHAUSER和ALEX SOUZA与EUGENIO(4830624000197)一起出现在另一个cnpj中
我想这可能是groupby领域的某个东西,但不确定如何实现它。以下方法有效:
In[0]:
def add_relations(row):
current_name = row['Nome Pessoa']
cnpjs = df[df['Nome Pessoa'] == current_name]['cnpj']
relations = df['cnpj'].isin(cnpjs)
output = df[relations]['Nome Pessoa'][df['Nome Pessoa'] != current_name]
return list(output)
df['Relations'] = df.apply(add_relations, axis=1)
df
Out[0]:
cnpj Nome Pessoa \
0 410000132 EUGENIO LUPORINI NETO
1 410000132 JUAN MATIAS SERAGOPIAN
2 4830624000197 EUGENIO LUPORINI NETO
3 4830624000197 SIMONE FANKHAUSER
4 4830624000197 ALEX SOUZA
Relations
0 [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...
1 [EUGENIO LUPORINI NETO]
2 [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...
3 [EUGENIO LUPORINI NETO, ALEX SOUZA]
4 [EUGENIO LUPORINI NETO, SIMONE FANKHAUSER]
这会使用apply
,因此不是最佳选择,但可能会很好,具体取决于您拥有的数据量
更新:我也尝试了用groupby
制作一些东西,并提出了以下方法,这些方法也很有效,但感觉不太理想,因为它使用了groupby
2x,而且对列表的理解非常糟糕。我觉得有一个更好的答案,但它是逃避我
num_to_name = df.groupby('cnpj')['Nome Pessoa'].apply(list)
name_to_num = df.groupby('Nome Pessoa')['cnpj'].apply(list)
df['Relations'] = df['Nome Pessoa'].map(name_to_num)
df['Relations'] = [[x for x in num_to_name.loc[df.loc[i,'Relations']].values.sum()
if x != df.loc[i, 'Nome Pessoa']] for i in df.index]
感谢您对上述内容的帮助。以下功能有效:
In[0]:
def add_relations(row):
current_name = row['Nome Pessoa']
cnpjs = df[df['Nome Pessoa'] == current_name]['cnpj']
relations = df['cnpj'].isin(cnpjs)
output = df[relations]['Nome Pessoa'][df['Nome Pessoa'] != current_name]
return list(output)
df['Relations'] = df.apply(add_relations, axis=1)
df
Out[0]:
cnpj Nome Pessoa \
0 410000132 EUGENIO LUPORINI NETO
1 410000132 JUAN MATIAS SERAGOPIAN
2 4830624000197 EUGENIO LUPORINI NETO
3 4830624000197 SIMONE FANKHAUSER
4 4830624000197 ALEX SOUZA
Relations
0 [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...
1 [EUGENIO LUPORINI NETO]
2 [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...
3 [EUGENIO LUPORINI NETO, ALEX SOUZA]
4 [EUGENIO LUPORINI NETO, SIMONE FANKHAUSER]
这会使用apply
,因此不是最佳选择,但可能会很好,具体取决于您拥有的数据量
更新:我也尝试了用groupby
制作一些东西,并提出了以下方法,这些方法也很有效,但感觉不太理想,因为它使用了groupby
2x,而且对列表的理解非常糟糕。我觉得有一个更好的答案,但它是逃避我
num_to_name = df.groupby('cnpj')['Nome Pessoa'].apply(list)
name_to_num = df.groupby('Nome Pessoa')['cnpj'].apply(list)
df['Relations'] = df['Nome Pessoa'].map(name_to_num)
df['Relations'] = [[x for x in num_to_name.loc[df.loc[i,'Relations']].values.sum()
if x != df.loc[i, 'Nome Pessoa']] for i in df.index]
感谢提供上述帮助。您可以使用
apply
并在其中添加查询,然后将结果附加到数据框:
def get_关系(行,df):
行_cnpj=行['cnpj']
行名称=行['Nome Pessoa']
query=df.query('cnpj=@row\u cnpj和'Nome Pessoa'!=@row\u name'))
行['Relations']=查询['Nome Pessoa']。值
返回行
df=df.apply(λx:get_关系(x,df),轴=1)
您可以使用apply
并在其中添加查询,然后将结果附加到数据框:
def get_关系(行,df):
行_cnpj=行['cnpj']
行名称=行['Nome Pessoa']
query=df.query('cnpj=@row\u cnpj和'Nome Pessoa'!=@row\u name'))
行['Relations']=查询['Nome Pessoa']。值
返回行
df=df.apply(λx:get_关系(x,df),轴=1)