使用python在两个数据帧之间搜索关键字
嗨,我有两个数据帧,如下所示使用python在两个数据帧之间搜索关键字,python,pandas,dataframe,data-analysis,keyword-search,Python,Pandas,Dataframe,Data Analysis,Keyword Search,嗨,我有两个数据帧,如下所示 DF1 Alpha | Numeric | Special and, or | 1,2,3,4,5| @,$,& 及 我想搜索DF1中的任何列是否在DF2的内容列中有任何关键字,并且输出应该在新的DF中 output_DF output_column| Alpha | Special | 有人帮我解决这个问题有点复杂,因为对于多个匹配,第2行只需要匹配第一列df1: 编辑: : 请正确设置数据框的
DF1
Alpha | Numeric | Special
and, or | 1,2,3,4,5| @,$,&
及
我想搜索DF1中的任何列是否在DF2的内容列中有任何关键字,并且输出应该在新的DF中
output_DF
output_column|
Alpha |
Special |
有人帮我解决这个问题有点复杂,因为对于多个匹配,第2行只需要匹配第一列df1: 编辑: :
请正确设置数据框的格式,因为不清楚列实际包含的内容。还不清楚这些数据是什么。答案是:嗨,耶兹雷尔,实际上我正在读取df1和df2的csv文件,如果我们手动创建字典,您的解决方案工作正常。但当我使用csv文件制作df时,它并没有按预期工作。我是关键错误:列名,是否可以用您的代码将一些csv示例数据发送给我的电子邮件?如果没有数据,这是一个很难找到的问题。好的,我会发送,如何在数据框中获取列名称作为列表我尝试了listmy_dataframe,当数据框有多个列名时,它会工作。但是在我的代码中,当我尝试列出my_dataframe时,我的dataframe中只有一列。我以列表的形式获取列中的值。当dataframe只有一列时,是否有其他方法来查找列名。@jzrael您对此有什么解决方案吗?Hi@Jezrael,此解决方案区分大小写,它不适用于不同情况下的相同关键字。我如何申请忽略案例,请帮助我
output_DF
output_column|
Alpha |
Special |
df1 = pd.DataFrame({'Alpha':['and','or', None, None,None],
'Numeric':['1','2','3','4','5'],
'Special':['@','$','&', None, None]})
print (df1)
Alpha Numeric Special
0 and 1 @
1 or 2 $
2 None 3 &
3 None 4 None
4 None 5 None
df2 = pd.DataFrame({'Content':['boy or girl','school @ morn',
'1 school @ morn', 'Pechi']})
print (df2)
Content
0 boy or girl
1 school @ morn
2 1 school @ morn
3 Pechi
#reshape df1
df1.columns = [np.arange(len(df1.columns)), df1.columns]
df11 = df1.unstack()
.reset_index(level=2,drop=True)
.rename_axis(('col_order','col_name'))
.dropna()
.reset_index(name='val')
print (df11)
col_order col_name val
0 0 Alpha and
1 0 Alpha or
2 1 Numeric 1
3 1 Numeric 2
4 1 Numeric 3
5 1 Numeric 4
6 1 Numeric 5
7 2 Special @
8 2 Special $
9 2 Special &
#split column by whitespaces, reshape
df22 = df2['Content'].str.split(expand=True)
.stack()
.rename('val')
.reset_index(level=1,drop=True)
.rename_axis('idx').reset_index()
print (df22)
idx val
0 0 boy
1 0 or
2 0 girl
3 1 school
4 1 @
5 1 morn
6 2 1
7 2 school
8 2 @
9 2 morn
10 3 Pechi
#left join dataframes, remove non match values by dropna
#also for multiple match get always first - use sorting with drop_duplicates
df = pd.merge(df22, df11, on='val', how='left')
.dropna(subset=['col_name'])
.sort_values(['idx','col_order'])
.drop_duplicates(['idx'])
#if necessary get values from df2
#if no value matched add Other category
df = pd.concat([df2, df.set_index('idx')], axis=1)
.fillna({'col_name':'Other'})[['val','col_name','Content']]
print (df)
val col_name Content
0 or Alpha boy or girl
1 @ Special school @ morn
2 1 Numeric 1 school @ morn
3 NaN Other Pechi
df1 = pd.DataFrame({'Alpha':['and','or', None, None,None],
'Numeric':['1','2','3','4','5'],
'Special':['@','$','&', None, None]})
df2 = pd.DataFrame({'Content':['boy OR girl','school @ morn',
'1 school @ morn', 'Pechi']})
#If df1 Alpha values are not lower
#df1['Alpha'] = df1['Alpha'].str.lower()
df1.columns = [np.arange(len(df1.columns)), df1.columns]
df11 = (df1.unstack()
.reset_index(level=2,drop=True)
.rename_axis(('col_order','col_name'))
.dropna()
.reset_index(name='val_low'))
df22 = (df2['Content'].str.split(expand=True)
.stack()
.rename('val')
.reset_index(level=1,drop=True)
.rename_axis('idx')
.reset_index())
#convert columns values to lower to new column
df22['val_low'] = df22['val'].str.lower()
df = (pd.merge(df22, df11, on='val_low', how='left')
.dropna(subset=['col_name'])
.sort_values(['idx','col_order'])
.drop_duplicates(['idx']))
df = (pd.concat([df2, df.set_index('idx')], axis=1)
.fillna({'col_name':'Other'})[['val','col_name','Content']])
print (df)
val col_name Content
0 OR Alpha boy OR girl
1 @ Special school @ morn
2 1 Numeric 1 school @ morn
3 NaN Other Pechi