用Python中的元组列表匹配相邻的列表元素_Python

用Python中的元组列表匹配相邻的列表元素

python

用Python中的元组列表匹配相邻的列表元素,python,Python,我有一个文档中单个单词的有序列表，如下所示： words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...] bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...] 我还有第二个重要的双元组/搭配元组列表，如下所示： words = ['apple', 'orange', 'boat', 'car', 'happy', 'da

我有一个文档中单个单词的有序列表，如下所示：

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]

bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

我还有第二个重要的双元组/搭配元组列表，如下所示：

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]

bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

我想遍历单个单词的列表，并用下划线分隔的二元图替换相邻单词，最后得到如下列表：

words_fixed = ['apple_orange', 'boat', 'car', 'happy_day', 'cow', ...]

我曾考虑过将

单词

和

bigrams

扁平化为字符串（

“”。join（words）

，等等），然后使用正则表达式查找并替换相邻的单词，但这似乎效率极低且不和谐

快速匹配和组合元组列表中相邻列表元素的最佳方法是什么

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

首先，一些优化：

import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
    bigrams[w1].add(w2)

现在，我们来谈谈有趣的事情：

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))

如果你想看到不在你的大字组中的单词，除了你在大字组中记录的单词，那么这应该可以做到：

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))
    else:
        words_fixed.append(w1)

首先，一些优化：

import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
    bigrams[w1].add(w2)

现在，我们来谈谈有趣的事情：

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))

如果你想看到不在你的大字组中的单词，除了你在大字组中记录的单词，那么这应该可以做到：

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))
    else:
        words_fixed.append(w1)

不像@inspectorG4dget那样浮华：

words_fixed = []
last = None
for word in words:
    if (last,word) in bigrams:
        words_fixed.append( "%s_%s" % (last,word) )
        last = None
    else:
        if last:
            words_fixed.append( last )
        last = word
if last:
    words_fixed.append( last )

不像@inspectorG4dget那样浮华：

words_fixed = []
last = None
for word in words:
    if (last,word) in bigrams:
        words_fixed.append( "%s_%s" % (last,word) )
        last = None
    else:
        if last:
            words_fixed.append( last )
        last = word
if last:
    words_fixed.append( last )

[编辑]创建字典的另一种方法：

from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))

[编辑]创建字典的另一种方法：

from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))

结果

words   : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']

结果

words   : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']

这最终会颠倒二元图，导致

['apple\u orange'、'orange\u apple'、'boat'、'car'、'happy\u day'、'day\u happy'、'cow']

好吧，您需要的解决方案是从@inspectorG4dget；）这最终会颠倒二元图，导致

['apple\u orange'、'orange\u apple'、'boat'、'car'、'happy\u day'、'day\u happy'、'cow']

好吧，您需要的解决方案是从@inspectorG4dget；）哦，这很酷，但它最终只返回了加入的大人物：

['apple\u orange'，'happy\u day']

太棒了。这也太快了。谢谢事实上，差不多。只添加

w1

是可行的，但它会遗漏最后一个单词元素，因为它存储在

w2

中。它的结局是

['apple\u orange'，'orange'，'boat'，'car'，'happy\u day'，'day']

，重复

day

并错过

cow

。哦，这很酷，但它最终只返回了加入的大人物：

['apple\u orange'，'happy\u day']

太棒了。这也太快了。谢谢事实上，差不多。只添加

w1

是可行的，但它会遗漏最后一个单词元素，因为它存储在

w2

中。结果是

['apple\u orange'、'orange'、'boat'、'car'、'happy\u day']

，重复

day

，错过

cow

。太棒了。虽然没有itertools版本那么华丽，但它可以在不改变原始

单词

和

bigrams

变量的情况下工作。不像itertools版本那样华而不实，但它可以在不改变原始

单词

和

bigrams

变量的情况下工作。