Python 从一个句子中找出所有正确的分支词对

Python 从一个句子中找出所有正确的分支词对,python,nlp,nltk,itertools,Python,Nlp,Nltk,Itertools,假设我有一个字符串,比如: 'velvet evening purse bags' 我怎样才能得到这个词的所有词对?换言之,这两个词的所有组合: 'velvet evening' 'velvet purse' 'velvet bags' 'evening purse' 'evening bags' 'purse bags' 我知道python的nltk包可以提供bigram,但我正在寻找超越该功能的东西。或者我必须用Python编写自己的自定义函数吗?您可以使用它: s = 'velvet

假设我有一个字符串,比如:

 'velvet evening purse bags'
我怎样才能得到这个词的所有词对?换言之,这两个词的所有组合:

'velvet evening'
'velvet purse'
'velvet bags'
'evening purse'
'evening bags'
'purse bags'
我知道python的
nltk
包可以提供bigram,但我正在寻找超越该功能的东西。或者我必须用Python编写自己的自定义函数吗?

您可以使用它:

s = 'velvet evening purse bags'

from nltk import word_tokenize

words = word_tokenize(s)

from itertools import combinations

pairs = [' '.join(comb) for comb in combinations(words, 2)]

print(pairs)
输出:

['velvet evening', 'velvet purse', 'velvet bags', 'evening purse', 'evening bags', 'purse bags']

你也可以去旧学校

text='天鹅绒晚礼服钱包'
n=[]
ans=[]
对于文本中的i.split():
对于text.split()中的j:
如果j!=一:
如果(i,j)不在n中:
附加说明((i,j))
n、 附加((i,j))
n、 附加((j,i))
输出

[('velvet', 'evening'),
 ('velvet', 'purse'),
 ('velvet', 'bags'),
 ('evening', 'purse'),
 ('evening', 'bags'),
 ('purse', 'bags')]

这应该很有趣=)

如果输入是
天鹅绒晚装钱包
,而所需输出是@MrGeek使用
itertools.compositions
生成的,那么实际上这就是
skipgrams
的定义

因此,您可以通过以下方式实现相同的目标:

from nltk import skipgrams

s = 'velvet evening purse bags'
tokens = word_tokenize(s)
list(skipgrams(tokens, n=2, k=len(tokens)-1))
[out]:

[('velvet', 'evening'),
 ('velvet', 'purse'),
 ('velvet', 'bags'),
 ('evening', 'purse'),
 ('evening', 'bags'),
 ('purse', 'bags')]
[('velvet', 'velvet'),
 ('velvet', 'evening'),
 ('velvet', 'purse'),
 ('velvet', 'bags'),
 ('evening', 'velvet'),
 ('evening', 'evening'),
 ('evening', 'purse'),
 ('evening', 'bags'),
 ('purse', 'velvet'),
 ('purse', 'evening'),
 ('purse', 'purse'),
 ('purse', 'bags'),
 ('bags', 'velvet'),
 ('bags', 'evening'),
 ('bags', 'purse'),
 ('bags', 'bags')]
在这种情况下,每个单词只能与它右边的另一个单词组合;这在某种程度上符合人类的英语语言

在这种情况下,所有单词的“排列”成对出现,即使是单词本身:

from itertools import product
s = 'velvet evening purse bags'
tokens = set(word_tokenize(s))
list(product(tokens, tokens))
[out]:

[('velvet', 'evening'),
 ('velvet', 'purse'),
 ('velvet', 'bags'),
 ('evening', 'purse'),
 ('evening', 'bags'),
 ('purse', 'bags')]
[('velvet', 'velvet'),
 ('velvet', 'evening'),
 ('velvet', 'purse'),
 ('velvet', 'bags'),
 ('evening', 'velvet'),
 ('evening', 'evening'),
 ('evening', 'purse'),
 ('evening', 'bags'),
 ('purse', 'velvet'),
 ('purse', 'evening'),
 ('purse', 'purse'),
 ('purse', 'bags'),
 ('bags', 'velvet'),
 ('bags', 'evening'),
 ('bags', 'purse'),
 ('bags', 'bags')]

假设你需要双字组,而不仅仅是成对的,那么双字组需要是成对的连续单词。因此,您将迭代两个iterable,其中一个按步骤前进。查看
tee
zip\u longest
。只需一条评论,也可以从
nltk
使用
word\u tokenize
!谢谢我真的很喜欢这个答案来补充标记的答案