Python 在EC2上使用请求和优化组时出现内存错误

Python 在EC2上使用请求和优化组时出现内存错误,python,amazon-web-services,amazon-ec2,beautifulsoup,python-requests,Python,Amazon Web Services,Amazon Ec2,Beautifulsoup,Python Requests,我正在使用请求和BeautifulSoup解析Wikidata以构建Person对象。我能够成功地做到这一点,但是,在创建了~3000个Person对象之后,当我迭代执行此操作时,我遇到了下面的MemoryError MemoryError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "TreeBuilder.py"

我正在使用请求和BeautifulSoup解析Wikidata以构建Person对象。我能够成功地做到这一点,但是,在创建了~3000个Person对象之后,当我迭代执行此操作时,我遇到了下面的MemoryError

MemoryError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "TreeBuilder.py", line 11, in <module>
    ancestor = Person(next['id'])
  File "/home/ec2-user/Person.py", line 14, in __init__
    html = soup (data , 'lxml')
  File "/usr/local/lib/python3.7/site-packages/bs4/__init__.py", line 325, in __init__
    self._feed()
  File "/usr/local/lib/python3.7/site-packages/bs4/__init__.py", line 399, in _feed
    self.builder.feed(self.markup)
  File "/usr/local/lib/python3.7/site-packages/bs4/builder/_lxml.py", line 324, in feed
    self.parser.feed(markup)
  File "src/lxml/parser.pxi", line 1242, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 1285, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 855, in lxml.etree._BaseParser._getPushParserContext
  File "src/lxml/parser.pxi", line 871, in lxml.etree._BaseParser._createContext
  File "src/lxml/parser.pxi", line 528, in lxml.etree._ParserContext.__cinit__
SystemError: <class 'lxml.etree._ErrorLog'> returned a result with an error set
在Pycharm中仅在AWS EC2服务器上运行程序时,此错误不会发生在我的本地计算机上

更新

请参阅下面的代码。我在每100次迭代之后添加了
gc.collect()
,这似乎没有什么帮助

Person.py

TreeBuilder.py


不管多么烦人,解决方案是将所有对BS对象的引用封装在一个
str()
调用中。在执行
birth[0].string
时,似乎只存储了对裸字符串的引用,但在循环中执行此操作时,对BS对象的引用却一直在内存中生成。试一试,我只是用你的代码试了一下,内存占用仍然很低

只需确保捕获所有引用,例如:

self.children.append({'name':str(a.string),'id':str(a['title']))})


编辑:请参阅此部分,了解BS最终保留这些引用的可能原因。

您有多少RAM(EC2)?这是一个t2.micro,有1GB内存和1个vCPUdid。请检查您是否确实耗尽了内存?我确实耗尽了内存,但我不知道原因。我没有存储任何对象,它们应该在我使用完后被垃圾收集,不是吗?但是,当您在本地pc上运行数据时,请检查数据是否存储在内存中(即使使用taskmgr或等效工具),您知道它们为什么会将引用保留在内存中吗?我把这个类改成了一个函数,它返回一个字典,而不是创建一个对象,而且它似乎工作得很好,即使没有字符串的BS对象也是如此。介意发布修改后的工作代码吗?这可能是BS图书馆的一个奇怪问题,因为SO上的其他人在意外情况下遇到了相同的问题。
try:
  data = requests.get (url).text
  html = soup(data, 'lxml')
except MemoryError:
  return None
import requests
from bs4 import BeautifulSoup as soup

class Person:

    def __init__(self, id):
        url = 'https://www.wikidata.org/wiki/' + id
        data = requests.get (url).text
        html = soup (data , 'lxml')

        ### id ###
        self.id = id

        ### Name ###
        if html.find ("span" , {"class": "wikibase-title-label"}) != None:
            self.name = html.find ("span" , {"class": "wikibase-title-label"}).string
        else:
            self.name = ""

        ### Born ###
        self.birth = ""

        birth = html.find ("div" , {"id": "P569"})
        if birth != None:
            birth = birth.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
            if len(birth) > 0:
                self.birth = birth[0].string

        ### Death ###
        self.death = ""

        death = html.find ("div" , {"id": "P570"})
        if death != None:
            death = death.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
            if len(death) > 0:
                self.death = death[0].string

        #### Sex ####
        sex = html.find ("div" , {"id": "P21"})

        if sex != None:
            for item in sex.strings:
                if item == 'male' or item == 'female':
                    self.sex = item

        ### Mother ###
        self.mother = ""

        mother = html.find ("div" , {"id": "P25"})
        if mother != None:
            mother = mother.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
            if len(mother) > 0:
                self.mother = {"name": mother[0].string , "id": mother[0].find ('a')['title']}

        ### Father ###
        self.father = ""

        father = html.find ("div" , {"id": "P22"})
        if father != None:
            father = father.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
            if len(father) > 0:
                self.father = {"name": father[0].string , "id": father[0].find ('a')['title']}

        ### Children ###
        self.children = []
        x = html.find("div" , {"id": "P40"})
        if x != None:
            x = x.findAll("div" , {"class": "wikibase-statementview"})

            for i in x:
                a = i.find ('a')
                if a != None and a['title'][0] == 'Q':
                    self.children.append ({'name': a.string , 'id': a['title']})

    def __str__(self):
        return self.name + "\n\tBirth: " + self.birth + "\n\tDeath: " + self.death + "\n\n\tMother: " + \
               self.mother['name'] + "\n\tFather: " + self.father['name'] + "\n\n\tNumber of Children: " + \
               str(len(self.children))
from Person import Person
import gc, sys

file = open('ancestors.txt', 'w+')

ancestors = [{'name':'Charlemange', 'id':'Q3044'}]
all = [ancestors[0]['id']]
i = 1

while ancestors != []:
    next = ancestors.pop(0)
    ancestor = Person(next['id'])

    for child in ancestor.children:
        if child['id'] not in all:
            all.append(child['id'])
            ancestors.append(child)

    if ancestor.mother != "" and ancestor.mother['id'] not in all:
        all.append(ancestor.mother['id'])
        ancestors.append(ancestor.mother)

    if ancestor.father != "" and ancestor.father['id'] not in all:
        all.append(ancestor.father['id'])
        ancestors.append(ancestor.father)


    file.write(ancestor.id + "*" + ancestor.name + "*" + "https://www.wikidata.org/wiki/" + ancestor.id + "*" + str(ancestor.birth) + "*" + str(ancestor.death) + "\n")

    if i % 100 == 0:
        print (ancestor.name + " (" + ancestor.id + ")" + " - " + str(len(all)) + " - " + str (len(ancestors)) + " - " + str (sys.getsizeof(all)))
        gc.collect()

    i += 1

file.close()
print("\nDone!")