为什么这个Python方法会泄漏内存？_Python_Memory Leaks

为什么这个Python方法会泄漏内存？

python memory-leaks

为什么这个Python方法会泄漏内存？,python,memory-leaks,Python,Memory Leaks,此方法迭代数据库中的术语列表，检查这些术语是否在作为参数传递的文本中，如果是，则将其替换为指向搜索页面的链接，并将术语作为参数术语的数量很多（大约100000个），因此过程非常缓慢，但这是可以的，因为它是作为cron作业执行的。但是，它导致脚本内存消耗猛增，我找不到原因： class SearchedTerm(models.Model): [...] @classmethod def add_search_links_to_text(cls, string, count=3, querys

此方法迭代数据库中的术语列表，检查这些术语是否在作为参数传递的文本中，如果是，则将其替换为指向搜索页面的链接，并将术语作为参数

术语的数量很多（大约100000个），因此过程非常缓慢，但这是可以的，因为它是作为cron作业执行的。但是，它导致脚本内存消耗猛增，我找不到原因：

class SearchedTerm(models.Model):

[...]

@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        text. If they exist, turn them into links to the search
        page.

        This process is limited to `count` replacements maximum.

        WARNING: because the sites got different URLS schemas, we don't
        provides direct links, but we inject the {% url %} tag 
        so it must be rendered before display. You can use the `eval`
        tag from `libs` for this. Since they got different namespace as
        well, we enter a generic 'namespace' and delegate to the 
        template to change it with the proper one as well.

        If you have a batch process to do, you can pass a query set
        that will be used instead of getting all searched term at
        each calls.
    """

    found = 0

    terms = queryset or cls.on_site.all()

    # to avoid duplicate searched terms to be replaced twice 
    # keep a list of already linkified content
    # added words we are going to insert with the link so they won't match
    # in case of multi passes
    processed = set((u'video', u'streaming', u'title', 
                     u'search', u'namespace', u'href', u'title', 
                     u'url'))

    for term in terms:

        text = term.text.lower()

        # no small word and make
        # quick check to avoid all the rest of the matching
        if len(text) < 3 or text not in string:
            continue

        if found and cls._is_processed(text, processed):
            continue

        # match the search word with accent, for any case
        # ensure this is not part of a word by including 
        # two 'non-letter' character on both ends of the word
        pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text, 
                            re.UNICODE|re.IGNORECASE)

        if re.search(pattern, string):
            found += 1

            # create the link string
            # replace the word in the description 
            # use back references (\1, \2, etc) to preserve the original
            # formatin
            # use raw unicode strings (ur"string" notation) to avoid
            # problems with accents and escaping

            query = '-'.join(term.text.split())
            url = ur'{%% url namespace:static-search "%s" %%}' % query
            replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url

            string = re.sub(pattern, replace_with, string)

            processed.add(text)

            if found >= 3:
                break

    return string

我实际上只有两个引用对象可能是可疑对象：

terms

和

processed

。但我看不出他们有什么理由不被垃圾收集

编辑：

我想我应该说这个方法是在Django模型方法内部调用的。我不知道是否相关，但代码如下：

class Video(models.Model):

[...]

def update_html_description(self, links=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        description. If they exist, turn them into links to the search
        engine. Put the reset into `html_description`.

        This use `add_search_link_to_text` and has therefor, the same 
        limitations.

        It DOESN'T call save().
    """
    queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
    text = self.description or self.title
    self.html_description = SearchedTerm.add_search_links_to_text(text, 
                                                                  links, 
                                                                  queryset)

我可以想象自动Python正则表达式缓存会占用一些内存。但是它应该只执行一次，每次调用

update\u html\u description

时，内存消耗都会增加

问题不仅在于它消耗了大量内存，还在于它没有释放内存：每次调用占用大约3%的ram，最终将其填满，并用“无法分配内存”将脚本崩溃。

一旦调用它，整个queryset将加载到内存中，这将消耗您的内存。如果结果集太大，则可能会对数据库进行更多的点击，但这将意味着内存消耗大大减少。

请确保您没有在调试中运行

我想我应该说这个方法是在Django模型方法内部调用的

@类方法

为什么?？为什么是“班级级别”

为什么这些可以有普通作用域规则的普通方法——在正常的事件过程中——不被垃圾收集

换句话说（以答案的形式）

摆脱

@classmethod

我完全无法找到问题的原因，但现在我通过调用包含此方法调用的脚本（使用

子流程

）来隔离臭名昭著的代码段来传递此问题。内存会增加，但当然，在python进程结束后会恢复正常

说脏话

但这就是我现在所能得到的。

像Python这样的垃圾收集语言几乎不可能泄漏内存。严格地说，内存泄漏是指没有变量引用的内存。在C++中，如果在类中分配内存，但不声明析构函数，则会出现内存泄漏。这里的内存消耗很高。

：-）好的。然后我得到了一个高内存消耗越来越高，每次调用后。但因为这是一种方法。既然我没有在任何事情完成后保持引用，为什么某些事情仍然消耗内存？更新了有关此问题的问题。只是为了确保：您是否验证了此调用是内存消耗的原因？是：删除它，内存保持不变。目前我没有它，因为它对网站不重要。我对整个查询集都在内存中没问题。这并不多：100000字符串最大值包装在一个模型对象中。每次调用后，都应该进行垃圾收集。问题是每次调用都会累积消耗内存。第一次调用占用内存的3%。下一次调用6%，以此类推。更新了关于此的问题。^^@classmethod用于类方法，该类方法在update_html_description（self，links=3，queryset=None）中调用，这是一个实例方法。这里没有混乱。@e-satis。我相信没有什么混乱。也没有必要。我添加了它们所属的类的名称以澄清问题：SearchedTerm有一个链接任何文本的类方法，而视频实例则使用该方法来提取它们的html\U描述。添加\搜索\链接\到\文本是一个类方法的原因当然是它是一个实用方法，并不意味着作用于SearchedTerm实例。由于您是124K代表，我只是通过将所有类方法转换为实例方法来测试代码。这没有改变任何事情。考虑到你的攻击性很强，给出了一个错误的答案，而且我已经看到你这么做了好几次，这对你来说是a-1。我在上次评论中谈到了聊天建议。放松点，伙计。如果你这样紧张，你会心脏病发作。：-）这是一个prod服务器，DEBUG设置为False。但好消息，这确实是已知的导致内存泄漏的原因，我知道你的回答迫使我检查。

class Video(models.Model):

[...]

def update_html_description(self, links=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        description. If they exist, turn them into links to the search
        engine. Put the reset into `html_description`.

        This use `add_search_link_to_text` and has therefor, the same 
        limitations.

        It DOESN'T call save().
    """
    queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
    text = self.description or self.title
    self.html_description = SearchedTerm.add_search_links_to_text(text, 
                                                                  links, 
                                                                  queryset)