Python 从不在另一个列表中的字符串中获取单词列表_Python_Python 2.7

Python 从不在另一个列表中的字符串中获取单词列表

python python-2.7

Python 从不在另一个列表中的字符串中获取单词列表,python,python-2.7,Python,Python 2.7,我有一个名为tekst（从文件中读取600 MB）的很长字符串和一个名为nlwoorden的11.000个单词的列表。我想拥有tekst中的一切，但不想拥有nlwoorden中的一切 belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden] 会生产出我想要的东西。显然，这需要很长时间来计算。有没有更有效的方法谢谢这个片段： woord not in nlwoorden 对于N=len（nlwoorden

我有一个名为

tekst

（从文件中读取600 MB）的很长字符串和一个名为

nlwoorden

的11.000个单词的列表。我想拥有

tekst

中的一切，但不想拥有

nlwoorden

中的一切

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

会生产出我想要的东西。显然，这需要很长时间来计算。有没有更有效的方法

谢谢

这个片段：

woord not in nlwoorden

对于

N=len（nlwoorden）

，每次调用时都将使用O（N）

所以你的清单

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

对于

M=len（tekst.split（））

，总时间为O（N*M）

这是因为

nlwoorden

是一个列表，而不是一个集合。为了测试无序列表中的成员身份，使用一种简单的方法，您必须在最坏的情况下遍历整个列表

这就是为什么您的语句花费了很长时间，输入量很大

如果您有一个散列集，那么一旦构建了该集，测试成员资格将需要固定的时间

所以，在原型代码形式中，类似这样的东西：

import io

def words(fileobj):
    for line in fileobj:             # takes care of buffering large files, chunks at a time
        for word in line.split():
            yield word

# first, build the set of whitelisted words
wpath = 'whitelist.txt'
wset = set()
with io.open(wpath, mode='rb') as w:
    for word in words(w):
        wset.add(word)

def emit(word):
    # output 'word' - to a list, to another file, to a pipe, etc
    print word

fpath = 'input.txt'
with io.open(fpath, mode='rb') as f:
    for word in words(f):               # total run time - O(M) where M = len(words(f))
        if word not in wset:            # testing for membership in a hash set - O(1)
            emit(word)

逐行读取并处理“woorden.txt”

将所有

nlwoorden

加载到集合中（这比加载到列表中更有效）

一部分一部分地读取大文件，对每个部分进行拆分，只将

lnwoorden

中没有的内容写入结果文件

假设您的大600 MB文件有合理的长行（不是600 MB长），我会这样做

nlwoorden = set()
with open("nlwoorden.txt") as f:
    for line in f:
        nlwoorden.update(line.split())

with open("woorden.txt") as f, with open("out.txt", "w") as fo:
    for line in f:
        newwords = set(line.split())
        newwords.difference_update(nlwoorden)
        fo.write(" ".join(newwords)

结论此解决方案不会消耗太多内存，因为您永远不会一次读取“woorden.txt”中的所有数据

如果您的文件未按行分割，则必须更改读取文件部分的方式。但是我假设，您的文件将有新行。

使用基于集合的解决方案将为您提供

O（len（nlwoorden））

。它应该需要另一个

O（len（nlwoorden））+O（len（tekst））

因此，您要查找的代码段基本上就是注释中列出的代码段：

belangrijk=list(set(tekst.split()) - set(nlwoorden))

（假设您想在结尾再次将其作为列表）

我认为最直接的方法是使用集合。比如说,

s = "This is a test"
s2 = ["This", "is", "another", "test"]
set(s.split()) - set(s2)

# returns {'a'}

但是，考虑到文本的大小，可能值得使用生成器来避免同时在内存中保存所有内容，例如

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()

[word for word in itersplit(s) if word not in s2]

# returns ['a']

重新导入
def itersplit（s，sep=无）：
exp=re.compile（r'\s+'如果sep不是其他的re.escape（sep））
pos=0
尽管如此：
m=exp.search（s，pos）
如果不是m：
如果pos

从

ntwoorden

设置

。塔达belangrijk=set（tekst.split（））-set（nlwoorden）
@false：我是一个新用户，所以可能有一些事情我不知道，但是你为什么把答案作为评论而不是答案发布呢？@Sohcahtoa82：我仍在试图找到一个重复的，但我还想扩展tobias_k所说的内容，回答一个你即将结束的问题是一种糟糕的形式。堆栈溢出很难！无论如何，对不起。