Python 从一个句子中找出所有正确的分支词对
假设我有一个字符串,比如:Python 从一个句子中找出所有正确的分支词对,python,nlp,nltk,itertools,Python,Nlp,Nltk,Itertools,假设我有一个字符串,比如: 'velvet evening purse bags' 我怎样才能得到这个词的所有词对?换言之,这两个词的所有组合: 'velvet evening' 'velvet purse' 'velvet bags' 'evening purse' 'evening bags' 'purse bags' 我知道python的nltk包可以提供bigram,但我正在寻找超越该功能的东西。或者我必须用Python编写自己的自定义函数吗?您可以使用它: s = 'velvet
'velvet evening purse bags'
我怎样才能得到这个词的所有词对?换言之,这两个词的所有组合:
'velvet evening'
'velvet purse'
'velvet bags'
'evening purse'
'evening bags'
'purse bags'
我知道python的nltk
包可以提供bigram,但我正在寻找超越该功能的东西。或者我必须用Python编写自己的自定义函数吗?您可以使用它:
s = 'velvet evening purse bags'
from nltk import word_tokenize
words = word_tokenize(s)
from itertools import combinations
pairs = [' '.join(comb) for comb in combinations(words, 2)]
print(pairs)
输出:
['velvet evening', 'velvet purse', 'velvet bags', 'evening purse', 'evening bags', 'purse bags']
你也可以去旧学校
text='天鹅绒晚礼服钱包'
n=[]
ans=[]
对于文本中的i.split():
对于text.split()中的j:
如果j!=一:
如果(i,j)不在n中:
附加说明((i,j))
n、 附加((i,j))
n、 附加((j,i))
输出
[('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'bags')]
这应该很有趣=) 如果输入是
天鹅绒晚装钱包
,而所需输出是@MrGeek使用itertools.compositions
生成的,那么实际上这就是skipgrams
的定义
因此,您可以通过以下方式实现相同的目标:
from nltk import skipgrams
s = 'velvet evening purse bags'
tokens = word_tokenize(s)
list(skipgrams(tokens, n=2, k=len(tokens)-1))
[out]:
[('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'bags')]
[('velvet', 'velvet'),
('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'velvet'),
('evening', 'evening'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'velvet'),
('purse', 'evening'),
('purse', 'purse'),
('purse', 'bags'),
('bags', 'velvet'),
('bags', 'evening'),
('bags', 'purse'),
('bags', 'bags')]
在这种情况下,每个单词只能与它右边的另一个单词组合;这在某种程度上符合人类的英语语言
在这种情况下,所有单词的“排列”成对出现,即使是单词本身:
from itertools import product
s = 'velvet evening purse bags'
tokens = set(word_tokenize(s))
list(product(tokens, tokens))
[out]:
[('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'bags')]
[('velvet', 'velvet'),
('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'velvet'),
('evening', 'evening'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'velvet'),
('purse', 'evening'),
('purse', 'purse'),
('purse', 'bags'),
('bags', 'velvet'),
('bags', 'evening'),
('bags', 'purse'),
('bags', 'bags')]
假设你需要双字组,而不仅仅是成对的,那么双字组需要是成对的连续单词。因此,您将迭代两个iterable,其中一个按步骤前进。查看
tee
和zip\u longest
。只需一条评论,也可以从nltk
使用word\u tokenize
!谢谢我真的很喜欢这个答案来补充标记的答案