Python 在数据框中查找句子中的多个单词并转换为分数总和
我有以下数据帧:Python 在数据框中查找句子中的多个单词并转换为分数总和,python,python-3.x,pandas,nltk,data-conversion,Python,Python 3.x,Pandas,Nltk,Data Conversion,我有以下数据帧: Sentence 0 Cat is a big lion 1 Dogs are descendants of wolf 2 Elephants are pachyderm 3 Pachyderm animals include rhino, Elephants and hippopotamus 我需要创建一个python代码,查看上面句子中的单词,并根据以下不同的数据帧计算每个单词的分数总和 Name Score cat
Sentence
0 Cat is a big lion
1 Dogs are descendants of wolf
2 Elephants are pachyderm
3 Pachyderm animals include rhino, Elephants and hippopotamus
我需要创建一个python代码,查看上面句子中的单词,并根据以下不同的数据帧计算每个单词的分数总和
Name Score
cat 1
dog 2
wolf 2
lion 3
elephants 5
rhino 4
hippopotamus 5
例如,对于第0行,分数将为1(猫)+3(狮子)=4
我希望创建一个如下所示的输出
Sentence Value
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and hippopotamus 14
首先,您可以尝试基于
拆分
和映射
的方法,然后使用groupby
计算分数
v = df1['Sentence'].str.split(r'[\s.!?,]+', expand=True).stack().str.lower()
df1['Value'] = (
v.map(df2.set_index('Name')['Score'])
.sum(level=0)
.fillna(0, downcast='infer'))
nltk
你可能需要下载一些东西
import nltk
nltk.download('punkt')
然后设置词干和标记化
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
创建一本方便的词典
m = dict(zip(map(ps.stem, scores.Name), scores.Score))
并生成分数
def f(s):
return sum(filter(None, map(m.get, map(ps.stem, word_tokenize(s)))))
df.assign(Score=[*map(f, df.Sentence)])
Sentence Score
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and... 14
尝试将
findall
与re
re.I
df.Sentence.str.findall(df1.Name.str.cat(sep='|'),flags=re.I).\
map(lambda x : sum([df1.loc[df1.Name==str.lower(y),'Score' ].values for y in x])[0])
Out[49]:
0 4
1 4
2 5
3 14
Name: Sentence, dtype: int64
我终于可以继续了。
df.Sentence.str.findall(df1.Name.str.cat(sep='|'),flags=re.I).\
map(lambda x : sum([df1.loc[df1.Name==str.lower(y),'Score' ].values for y in x])[0])
Out[49]:
0 4
1 4
2 5
3 14
Name: Sentence, dtype: int64