从文本中提取行的替代方法（python正则表达式）_Python_Regex_Text

从文本中提取行的替代方法（python正则表达式）

python regex text

从文本中提取行的替代方法（python正则表达式）,python,regex,text,Python,Regex,Text,我正在寻找一种从python中相当大的数据库中提取行的方法。我只需要保留那些包含我的关键字之一。我想我可以用正则表达式来解决这个问题，我已经把下面的代码放在一起了。不幸的是，它给了我一些错误（可能也是因为我的关键字（在文件listtosearch.txt中以单独的行写入）的数量确实很大，将近500个）我也尝试过使用双循环（在关键字列表和数据库的行中），但它需要花费太多的时间来运行我得到的错误是： Traceback (most recent call last): File "/usr

我正在寻找一种从python中相当大的数据库中提取行的方法。我只需要保留那些包含我的关键字之一。我想我可以用正则表达式来解决这个问题，我已经把下面的代码放在一起了。不幸的是，它给了我一些错误（可能也是因为我的关键字（在文件listtosearch.txt中以单独的行写入）的数量确实很大，将近500个）

我也尝试过使用双循环（在关键字列表和数据库的行中），但它需要花费太多的时间来运行

我得到的错误是：

Traceback (most recent call last):
  File "/usr/lib/python2.7/re.py", line 190, in compile 
    return _compile(pattern, flags)   
  File "/usr/lib/python2.7/re.py", line 240, in _compile 
    p = sre_compile.compile(pattern, flags) 
  File "/usr/lib/python2.7/sre_compile.py", line 511, in compile 
    "sorry, but this version only supports 100 named groups" 
AssertionError: sorry, but this version only supports 100 named groups

有什么建议吗？谢谢

首先，我很确定你指的是data=open（'database.txt'）.readlines（）
而不是

read（）

。否则，

data

将是一个字符串，而不是一个行列表，并且您的

for line in data

将没有任何意义

在这一点上，你真的在寻找一个通过关键字索引的解决方案，而简单的搜索将不再有效，无法及时给出结果

真的没有另一种方法能显著提高效率或降低复杂性。你将不得不咬紧牙关，并承担查看整个数据库的费用

此外，如果数据库完全放在内存中，它就不会那么大：）

也就是说，还有其他可能更有效的方法：

将您的关键字放入一个集合，然后将输入数据标记为word，并在集合中查找所有关键字：

data = open('database.txt').readlines() 
fileout = open("fileout.txt","w+")

with open('listtosearch.txt', 'r') as f:
  keywords = [line.strip() for line in f]

keywords = set(keywords)

for line in data:
    # You might have to be smarter about splitting the line to 
    # take things like punctuation into consideration.
    for word in line.split():
      if word in keywords:
        fileout.write(line)
        break

这是一个考虑标点符号的分词示例

首先，我很确定您的意思是data=open（'database.txt'）.readlines（）
，而不是

read（）

。否则，

data

将是一个字符串，而不是一个行列表，并且您的

for line in data

将没有任何意义

在这一点上，你真的在寻找一个通过关键字索引的解决方案，而简单的搜索将不再有效，无法及时给出结果

真的没有另一种方法能显著提高效率或降低复杂性。你将不得不咬紧牙关，并承担查看整个数据库的费用

此外，如果数据库完全放在内存中，它就不会那么大：）

也就是说，还有其他可能更有效的方法：

将您的关键字放入一个集合，然后将输入数据标记为word，并在集合中查找所有关键字：

data = open('database.txt').readlines() 
fileout = open("fileout.txt","w+")

with open('listtosearch.txt', 'r') as f:
  keywords = [line.strip() for line in f]

keywords = set(keywords)

for line in data:
    # You might have to be smarter about splitting the line to 
    # take things like punctuation into consideration.
    for word in line.split():
      if word in keywords:
        fileout.write(line)
        break

这是一个考虑标点符号的分词示例

这是我的密码：

重新导入
data=open（'database.txt'，'r'）
fileout=open（“fileout.txt”，“w+”）
以open（'listtosearch.txt'，'r'）作为f：
关键字=[line.strip（）表示f中的行]
#一个大的模式可能需要时间来匹配，所以你有一个它们的列表
patterns=[为关键字中的关键字重新编译（关键字）]
对于行输入数据：
对于模式中的模式：
如果不是模式。搜索（行）：
打破
其他：
fileout.write（行）

我使用以下文件对其进行了测试：

database.txt

"Name jhon" (1995)
"Name foo" (2000)
"Name fake" (3000)
"Name george" (2000)
"Name george" (2500)

listtosearch.txt

"Name (george)"
\(2000\)

这就是我在fileout.txt中得到的

"Name george" (2000)

因此，这应该也适用于您的机器。

以下是我的代码：

重新导入
data=open（'database.txt'，'r'）
fileout=open（“fileout.txt”，“w+”）
以open（'listtosearch.txt'，'r'）作为f：
关键字=[line.strip（）表示f中的行]
#一个大的模式可能需要时间来匹配，所以你有一个它们的列表
patterns=[为关键字中的关键字重新编译（关键字）]
对于行输入数据：
对于模式中的模式：
如果不是模式。搜索（行）：
打破
其他：
fileout.write（行）

我使用以下文件对其进行了测试：

database.txt

"Name jhon" (1995)
"Name foo" (2000)
"Name fake" (3000)
"Name george" (2000)
"Name george" (2500)

listtosearch.txt

"Name (george)"
\(2000\)

这就是我在fileout.txt中得到的

"Name george" (2000)

因此，这应该也适用于您的机器。

您可能需要查看。可以找到一个python的工作实现

此模块的一个简单用法示例：

from pyahocorasick import Trie

words = ['foo', 'bar']

t = Trie()
for w in words:
    t.add_word(w, w)
t.make_automaton()

print [a for a in t.iter('my foo is a bar')]

>> [(5, ['foo']), (14, ['bar'])]

在代码中进行集成应该很简单。

您可能想看看。可以找到一个python的工作实现

此模块的一个简单用法示例：

from pyahocorasick import Trie

words = ['foo', 'bar']

t = Trie()
for w in words:
    t.add_word(w, w)
t.make_automaton()

print [a for a in t.iter('my foo is a bar')]

>> [(5, ['foo']), (14, ['bar'])]

在代码中集成应该很简单。

可能不是一个有效的解决方案，但请尝试使用集合及其交集属性

from_db = tuple([line.rstrip("\n") for line in open('database.txt') if line.rstrip('\n')])
keywords = set([line.rstrip("\n") for line in open('listtosearch.txt') if line.rstrip('\n')])
with open("output_file.txt", "w") as fp:
    for line in from_db:
        line_set = set(line.split(" "))
        if line_set.intersection(keywords):
            fp.write(line + "\n")

交叉点将检查任何公共字符串。由于比较了散列值，我想搜索速度会更快，而不是一次又一次地遍历整个列表。

可能不是一个有效的解决方案，但请尝试使用set及其交集属性

from_db = tuple([line.rstrip("\n") for line in open('database.txt') if line.rstrip('\n')])
keywords = set([line.rstrip("\n") for line in open('listtosearch.txt') if line.rstrip('\n')])
with open("output_file.txt", "w") as fp:
    for line in from_db:
        line_set = set(line.split(" "))
        if line_set.intersection(keywords):
            fp.write(line + "\n")

交叉点将检查任何公共字符串。由于比较了散列值，我想搜索会更快，而不是一次又一次地运行整个列表。

它给了我以下错误：pattern=re.compile（“|”.join（关键字））File“/usr/lib/python2.7/re.py”，第190行，在compile-return（pattern，flags）File“/usr/lib/python2.7/re.py”，第240行，在_compilep=sre_compile.compile（pattern，flags）文件“/usr/lib/python2.7/sre_compile.py”的第511行中，在compile“抱歉，但此版本仅支持100个命名组”断言中，错误：抱歉，但此版本仅支持100个命名组。好了，它告诉您在正则表达式模式中不能有超过100个子表达式。不是你的错。brice的答案会起作用。事实上，即使在我运行brice的代码时，它也会给我完全相同的错误：（@user2447387，这是不可能的。我的代码没有使用

re

模块，我也没有不合适的行。我知道，对不起，我的不好！让我正确运行它。它会给我这些错误：pattern=re.compile（'|'.join（keywords））文件“/usr/lib/python2.7/re.py”，第190行，在编译返回编译（模式、标志）文件/usr/lib/python2.7/re.py中，第240行，在编译p中