Python 循环遍历一列中的所有值(字符串),如果不是唯一的文本处理,则将值追加到另一列中

Python 循环遍历一列中的所有值(字符串),如果不是唯一的文本处理,则将值追加到另一列中,python,pandas,dataframe,nlp,Python,Pandas,Dataframe,Nlp,我想找到以下问题的解决方案: import pandas as pd rows = {'Id': ['xb01','nt02','tw02','dt92','tw03','we04','er04','ew06','re07','ti92'], 'DatasetName': ['first label','second label','third label','fourth label','third label','third label','third label','f

我想找到以下问题的解决方案:

import pandas as pd

rows = {'Id': ['xb01','nt02','tw02','dt92','tw03','we04','er04','ew06','re07','ti92'],
    'DatasetName': ['first label','second label','third     label','fourth label','third 
label','third label','third label','fourth label','first  label','last label'],
    'Target': ['first label','second label','the    third labels','fourth label 
set','third    label', 'third label','third label  sets','fourth label    sets','first 
label','last labels']
    }

df = pd.DataFrame(rows, columns = ['Id', 'DatasetName','Target'])

print (df)
数据帧如下所示:

     Id      DatasetName                      Target

   xb01         first label              first label
   nt02        second label             second label
   tw02     third     label      the    third labels
   dt92        fourth label         fourth label set
   tw03         third label           third    label
   we04         third label              third label
   er04         third label        third label  sets
   ew06        fourth label     fourth label    sets
   re07        first  label              first label
   ti92          last label              last labels
伪代码:

   for i in len(range(df)):
      if DatasetName[i].is_unique:
         if DatasetName[i]!=Target[i]:
            Target[i]=DatasetName[i]+ '|'+Target[i]
      else:
         loop through dataframe and find all labels that belongs to the same DatasetName 
         and append all those Target names together. (Note: if DatasetName is not same as 
         Target Name(s), the Dataset name should also append to the Target)
在这里我们可以看到:

   DatasetName    Appeared   Target

   first label    2          first label
   second label   1          second label
   third label    4          the third labels | third label | third label sets
   fourth label   2          fourth label set | fourth label sets|fourth label
   last label     1          last labels | last label
预期输出:

   Id                  DatasetName                                             Target
  
 xb01                  first label                                        first label
 nt02                 second label                                      second  label
 tw02                  third label      the third labels|third label|third label sets
 dt92                 fourth label   fourth label set|fourth label sets |fourth label
 tw03                  third label      the third labels|third label|third label sets
 we04                  third label      the third labels|third label|third label sets
 er04                  third label      the third labels|third label|third label sets
 ew06                 fourth label   fourth label set|fourth label sets| fourth label
 re07                  first label                                        first label
 ti92                   last label                             last labels|last label
            
注意:实际数据帧有100000行。这些字符串中可能仍然存在额外的空格(我已经实现了dataframe lower case(),删除了所有额外的标记,等等)。在这个问题上可能会有一些错误(打字错误)(我已经复制和粘贴了好几次),但希望你能了解我正在寻找的解决方案。谢谢大家!

让我们尝试使用值和返回:

将熊猫作为pd导入
行={'Id':['xb01','nt02','tw02','dt92','tw03','we04',,
“er04”、“ew06”、“re07”、“ti92”],
“DatasetName”:[“第一个标签”、“第二个标签”、“第三个标签”,
“第四标签”、“第三标签”、“第三标签”,
‘第三标签’、‘第四标签’,
“第一个标签”、“最后一个标签”],
“目标”:[“第一个标签”、“第二个标签”、“第三个标签”,
“第四个标签集”、“第三个标签”,
“第三个标签”、“第三个标签集”,
“第四个标签集”、“第一个标签”、“最后一个标签”]
}
df=pd.DataFrame(行、列=['Id','DatasetName','Target'])
#固定列名中的间距
df=df.replace({r'\s+':''},regex=True)
#获取唯一匹配项
matches=df.groupby('DatasetName'))\
.apply(lambda x:x['DatasetName'].append(x['Target']).unique())\
.agg('|'.join).rename('Target'))
#合并回原始数据帧
merged=df.drop(columns=['Target']).merge(匹配,on='DatasetName',how=“left”)
#展示
打印(合并到字符串()
输出:

Id DatasetName Target 0 xb01 first label first label 1 nt02 second label second label 2 tw02 third label third label|the third labels|third label sets 3 dt92 fourth label fourth label|fourth label set|fourth label sets 4 tw03 third label third label|the third labels|third label sets 5 we04 third label third label|the third labels|third label sets 6 er04 third label third label|the third labels|third label sets 7 ew06 fourth label fourth label|fourth label set|fourth label sets 8 re07 first label first label 9 ti92 last label last label|last labels Id数据集名称目标 0 xb01第一个标签第一个标签 1个nt02第二个标签第二个标签 2 tw02第三标签第三标签|第三标签|第三标签组 3 dt92第四标签第四标签|第四标签集|第四标签集 4 tw03第三标签第三标签|第三标签|第三标签组 5 we04第三标签第三标签|第三标签|第三标签集 6 er04第三标签第三标签|第三标签|第三标签集 7 ew06第四标签第四标签|第四标签集|第四标签集 8 re07第一标签第一标签 9 ti92最后标签最后标签|最后标签
您好,我了解您在寻找什么,过去也遇到过同样的问题,并使用networkx解决了。谢谢@manuzambo,我会尝试一下。但是如果你想做一个演示,我会很感激:)看看这个问题:答案中有一个演示。谢谢@manuzambo,我想我需要更深入地了解一下。谢谢:)谢谢你@Henry Ecker。现在很完美了:我希望我能投两票。谢谢你的大力帮助!