Python 3.x 如何使用Spacy短语匹配器解决内存错误？高级背景_Python 3.x_Spacy_Match Phrase

Python 3.x 如何使用Spacy短语匹配器解决内存错误？高级背景

python-3.x

Python 3.x 如何使用Spacy短语匹配器解决内存错误？高级背景,python-3.x,spacy,match-phrase,Python 3.x,Spacy,Match Phrase,我正在做一个项目，第一步我在一个大型文本语料库中搜索关键词和短语。我想找出这些关键词出现的段落/句子。稍后，我想通过我的本地postgres db访问这些段落，以便用户查询信息。数据存储在Azure Blob存储上，我正在使用Minio服务器连接我的Django应用程序实际问题首先，我的shell被终止，在尝试重构/调试内存错误后，在运行我的脚本时：从blob存储中随机抽取30个文本文档（我想抽取10000个样本，但它的数量已经很低了）预处理nlp任务的原始文本通过spacy的nlp.

我正在做一个项目，第一步我在一个大型文本语料库中搜索关键词和短语。我想找出这些关键词出现的段落/句子。稍后，我想通过我的本地postgres db访问这些段落，以便用户查询信息。数据存储在Azure Blob存储上，我正在使用Minio服务器连接我的Django应用程序

实际问题首先，我的shell被终止，在尝试重构/调试内存错误后，在运行我的脚本时：

从blob存储中随机抽取30个文本文档（我想抽取10000个样本，但它的数量已经很低了）

预处理nlp任务的原始文本

通过spacy的nlp.pipe流式传输文本，以获取文档和

将文档列表流式传输到短语匹配器（它将匹配规则、句子的开始标记（带匹配）、句子、哈希id传递到匹配列表）

起初，炮弹被击毙了。我查看了日志文件，发现这是一个内存错误，但老实说，我对这个话题还不太熟悉

重新排列代码后，我在shell中直接得到了一个MemoryError。在language.pipe（）中，将文本流式传输到spaCy的步骤

代码摘录功能

# Function that samples filing_documents
def random_Filings(amount):
 ...
 return random_list

# Function that connects to storage and saves cleaned text
def get_clean_text(random_list):
  try:
    text_contents = S3Client().get_buffer(remote_path)
  ...
return clean_list

# matcher function that performs action on match of PhraseMatcher
def on_match(matcher, doc, id, matches):
  matcher_id, start, end = matches[id]
  rule_id = nlp.vocab.strings[match_id]
  token = doc[start]
  sent_of_token = token.sent
  match_list.append([str(rule_id), sent_of_token.start, sent_of_token, 
  doc.user_data])

def match_text_stream(clean_texts):
   some_pattern = [nlp(text) for text in ('foo', 'bar')]
   some_other_pattern = [nlp(text) for text in ('foo bar', 'barara')]

   matcher = PhraseMAtcher(nlp.vocab)

   matcher.add('SOME', on_match, *some_pattern)
   matcher.add('OTHER', on_match, *some_other_pattern)

   doc_list = []

   for doc in nlp.pipe(list_of_text, barch_size=30):
     doc_list.append(doc)

   for doc in matcher.pipi(doc_list, batch_size=30):
     pass

问题步骤：

match_list = []

nlp = en_core_web_sm.load()
sample_list = random_Filings(30)
clean_texts = get_clean_text(sample_list)
match_text_stream(clean_text)

print(match_list)

错误消息

解决方案是在培训前将文档切成小块。段落单元工作得很好，或者可能是段落。

目前我没有进行培训，而是使用短语匹配器来识别这些段落。谢谢您的提示。我注意到确实有一些文档非常大，我的虚拟机内存太少，无法处理它。。。

MemoryError
<string> in in match_text_stream(clean_text)

../spacy/language.py in pipe(self, texts, as_tubles, n thready, batch_size, disable cleanup, component_cfg)

709 origingal_strings_data = None
710 nr_seen = 0
711 for doc in docs:
712   yield doc
713   if cleanup:

MemoryError


../tick/neural/_classes/convolution.py in begin_update(self, X__bi, drop)

31
32 def(bedin_update(self,X__bi, drop=0.0):
33   X__bo = self.ops.seqcol(X__bi, self.nW)
34   finish_update = self._get_finsih_update()
35   return X__bo, finish_update

ops.pyx in thinc.neural.ops.NumpyOps.seq2col()
ops.pyx in thinc.neural.ops.NumpyOps.allocate()

MemoryError: