Python 如何在分句(pandas)中使用isin时获得单词的出现率?
我正在进行文本分析,试图将句子的价值量化为句子中某些单词的价值总和。我有一个带有词语和价值观的DF,例如:Python 如何在分句(pandas)中使用isin时获得单词的出现率?,python,pandas,Python,Pandas,我正在进行文本分析,试图将句子的价值量化为句子中某些单词的价值总和。我有一个带有词语和价值观的DF,例如: import pandas as pd df_w = pd.DataFrame( { 'word': [ 'high', 'sell', 'hello'], 'value': [ 32, 45, 12] } ) def sol_dict (df_s, df_w): # answer with a dict dict_w = pd
import pandas as pd
df_w = pd.DataFrame( { 'word': [ 'high', 'sell', 'hello'],
'value': [ 32, 45, 12] } )
def sol_dict (df_s, df_w): # answer with a dict
dict_w = pd.Series(df_w['value'].values,index=df_w['word']).to_dict()
df_s['value'] = df_s['sentence'].apply(lambda x: sum([dict_w[word] for word in x.split(' ') if word in dict_w.keys()]))
return df_s
def sol_wen(df_s, df_w): # answer of Wen
s=df_s.sentence.str.split(' ',expand=True).stack()
df_s['value']=s[s.isin(df_w.word)].replace(dict(zip(df_w.word,df_w.value))).sum(level=0)
return df_s
def sol_pi (df_s, df_w): # answer of piRSquared
dw = lambda x: dict(zip(df_w.word, df_w.value)).get(x, 0)
df_s.assign(value=[sum(map(dw, s.split())) for s in df_s.sentence])
# or df_s['value'] = [sum(map(dw, s.split())) for s in df_s.sentence]
return df_s
def sol_merge(df_s, df_w): # answer with merge
df_s['value'] = df_s['sentence'].apply(lambda x: sum(pd.merge(pd.DataFrame({'word':x.split(' ')}),df_w)['value']))
return df_s
def sol_stack(df_s, df_w): # answer with stack and merge
df_stack = pd.DataFrame({'word': df_s['sentence'].str.split(' ',expand=True).stack()})
df_s['value'] = df_stack.reset_index().merge(df_w).set_index(['level_0','level_1'])['value'].sum(level=0)
return df_s
然后我在另一个DF中有一些句子,例如:
df_s = pd.DataFrame({'sentence': [ 'hello life if good',
'i sell this at a high price',
'i sell or you sell'] } )
现在,我想在df_s
中添加一列,如果单词在df_w
中,则列中包含句子中每个单词的值之和。为此,我尝试:
df_s['value'] = df_s['sentence'].apply(lambda x: sum(df_w['value'][df_w['word'].isin(x.split(' '))]))
结果是:
sentence value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 45
我对这个答案的问题是,在最后一句话i sell或you sell
,我有两次sell
,我预期90(2*45),但sell
只考虑了一次,所以我得到了45
为了解决这个问题,我决定创建一个字典,然后执行一个apply
:
dict_w = pd.Series(df_w['value'].values,index=df_w['word']).to_dict()
df_s['value'] = df_s['sentence'].apply(lambda x: sum([dict_w[word] for word in x.split(' ') if word in dict_w.keys()]))
这一次,结果是我所期望的(最后一句话是90)。但是我的问题是DF较大,对于我的测试用例,使用dict\w
执行方法的时间大约是使用isin
执行方法的时间的20倍
你知道用
isin
将单词的值乘以其在方法中出现的次数的方法吗?也欢迎任何其他解决方案。您可以使用str.split
和stack
并过滤(isin
)结果,将这些关键字替换为值,然后重新分配
s=df_s.sentence.str.split(' ',expand=True).stack()
df_s['Value']=s[s.isin(df_w.word)].replace(dict(zip(df_w.word,df_w.value))).sum(level=0)
df_s
Out[984]:
sentence Value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 90
使用字典的get
方法中的默认值创建一个函数
dw = lambda x: dict(zip(df_w.word, df_w.value)).get(x, 0)
df_s.assign(value=[sum(map(dw, s.split())) for s in df_s.sentence])
sentence value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 90
由于使用他的map
函数回答了piRSquared,我产生了使用merge的想法,例如:
df_s['value'] = df_s['sentence'].apply(lambda x: sum(pd.merge(pd.DataFrame({'word':x.split(' ')}),df_w)['value']))
由于使用他的stack
函数得到了Wen的答案,我使用了他的想法,但是使用了merge
例如:
df_stack = pd.DataFrame({'word': df_s['sentence'].str.split(' ',expand=True).stack()})
df_s['value'] = df_stack.reset_index().merge(df_w).set_index(['level_0','level_1'])['value'].sum(level=0)
这两种方法都给了我正确的答案。
最后,为了测试哪种解决方案更快,我定义了如下函数:
import pandas as pd
df_w = pd.DataFrame( { 'word': [ 'high', 'sell', 'hello'],
'value': [ 32, 45, 12] } )
def sol_dict (df_s, df_w): # answer with a dict
dict_w = pd.Series(df_w['value'].values,index=df_w['word']).to_dict()
df_s['value'] = df_s['sentence'].apply(lambda x: sum([dict_w[word] for word in x.split(' ') if word in dict_w.keys()]))
return df_s
def sol_wen(df_s, df_w): # answer of Wen
s=df_s.sentence.str.split(' ',expand=True).stack()
df_s['value']=s[s.isin(df_w.word)].replace(dict(zip(df_w.word,df_w.value))).sum(level=0)
return df_s
def sol_pi (df_s, df_w): # answer of piRSquared
dw = lambda x: dict(zip(df_w.word, df_w.value)).get(x, 0)
df_s.assign(value=[sum(map(dw, s.split())) for s in df_s.sentence])
# or df_s['value'] = [sum(map(dw, s.split())) for s in df_s.sentence]
return df_s
def sol_merge(df_s, df_w): # answer with merge
df_s['value'] = df_s['sentence'].apply(lambda x: sum(pd.merge(pd.DataFrame({'word':x.split(' ')}),df_w)['value']))
return df_s
def sol_stack(df_s, df_w): # answer with stack and merge
df_stack = pd.DataFrame({'word': df_s['sentence'].str.split(' ',expand=True).stack()})
df_s['value'] = df_stack.reset_index().merge(df_w).set_index(['level_0','level_1'])['value'].sum(level=0)
return df_s
我的“大型”测试DFs由大约3200个df_w单词和大约42700个df_s单词组成(一次拆分所有句子)。我运行timeit
,使用多个大小的df_w(从320到3200字)和完整大小的df_s,然后使用多个大小的df_s(从3500到42700字)和完整大小的df_w。曲线拟合结果后,我得到:
总之,无论两个DFs的大小如何,使用stack
然后merge
的方法都非常有效(大约100ms,很抱歉在图形上看不到)。我在我的全尺寸DFs上运行它,在dfu w
240万单词df_s
中有大约54k个单词,我在几秒钟内得到了结果。
谢谢你们两位的想法