Python 残缺不全的产量逻辑被打破了_Python_Scrapy_Yield

Python 残缺不全的产量逻辑被打破了

python scrapy

Python 残缺不全的产量逻辑被打破了,python,scrapy,yield,Python,Scrapy,Yield,以下简单代码导致vBulletin论坛网站出现问题： class ForumSpider(CrawlSpider): ... rules = ( Rule(SgmlLinkExtractor(restrict_xpaths="//div[@class='threadlink condensed']"), callback='parse_threads'), ) def parse_threads(

以下简单代码导致vBulletin论坛网站出现问题：

class ForumSpider(CrawlSpider):
    ...

    rules = (
            Rule(SgmlLinkExtractor(restrict_xpaths="//div[@class='threadlink condensed']"),
            callback='parse_threads'),
            )

    def parse_threads(self, response):

        thread = HtmlXPathSelector(response)

        # get the list of posts
        posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/*")

        # plist = []
        for p in posts:
            table = ThreadItem()

            table['thread_id'] = (p.select("//input[@name='searchthreadid']/@value").extract())[0].strip()

            string_id = p.select("../@id").extract() # returns a list
            p_id = string_id[0].split("post")
            table['post_id'] = p_id[1]

            # plist.append(table)
            # return plist
            yield table

撇开xpath的一些漏洞不谈，当我使用yield运行这个程序时，我得到了非常奇怪的结果，对同一个线程id和帖子id进行了多次点击。类似于：

114763,1314728
114763,1314728
114763,1314728
114763,1314740
114763,1314740
114763,1314740

当我切换回与返回相同的逻辑时（在注释中），一切正常。我想这可能是发电机的一些基本错误，但我无法理解。为什么相同的帖子会被多次点击？为什么代码使用return而不是yield工作

gist中的完整代码片段。

看起来像是缩进问题。以下操作应与使用列表和返回相同：

def parse_threads(self, response):

    thread = HtmlXPathSelector(response)

    # get the list of posts
    posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/*")

    for p in posts:
        table = ThreadItem()

        table['thread_id'] = (p.select("//input[@name='searchthreadid']/@value").extract())[0].strip()

        string_id = p.select("../@id").extract() # returns a list
        p_id = string_id[0].split("post")
        table['post_id'] = p_id[1]

        yield table

UPD：我已经修复并改进了

parse_threads

方法的代码，现在应该可以工作了：

def parse_threads(self, response):
    thread = HtmlXPathSelector(response)
    thread_id = thread.select("//input[@name='searchthreadid']/@value").extract()[0].strip()
    post_id = thread.select("//div[@id='posts']//table[contains(@id,'post')]/@id").extract()[0].split("post")[1]

    # get the list of posts
    posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/tr[2]")
    for p in posts:
        # getting user_name
        user_name = p.select(".//a[@class='bigusername']/text()").extract()[0].strip()

        # skip adverts
        if 'Advertisement' in user_name:
            continue

        table = ThreadItem()
        table['user_name'] = user_name
        table['thread_id'] = thread_id
        table['post_id'] = p.select("../@id").extract()[0].split("post")[1]

        yield table

希望有帮助。

看起来这是一个缩进问题。以下操作应与使用列表和返回相同：

def parse_threads(self, response):

    thread = HtmlXPathSelector(response)

    # get the list of posts
    posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/*")

    for p in posts:
        table = ThreadItem()

        table['thread_id'] = (p.select("//input[@name='searchthreadid']/@value").extract())[0].strip()

        string_id = p.select("../@id").extract() # returns a list
        p_id = string_id[0].split("post")
        table['post_id'] = p_id[1]

        yield table

UPD：我已经修复并改进了

parse_threads

方法的代码，现在应该可以工作了：

def parse_threads(self, response):
    thread = HtmlXPathSelector(response)
    thread_id = thread.select("//input[@name='searchthreadid']/@value").extract()[0].strip()
    post_id = thread.select("//div[@id='posts']//table[contains(@id,'post')]/@id").extract()[0].split("post")[1]

    # get the list of posts
    posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/tr[2]")
    for p in posts:
        # getting user_name
        user_name = p.select(".//a[@class='bigusername']/text()").extract()[0].strip()

        # skip adverts
        if 'Advertisement' in user_name:
            continue

        table = ThreadItem()
        table['user_name'] = user_name
        table['thread_id'] = thread_id
        table['post_id'] = p.select("../@id").extract()[0].split("post")[1]

        yield table

希望能有帮助。

啊，我很抱歉！缩进在运行的代码中是正确的。那只是复制和粘贴错误。好吧，那它应该可以工作。不要看到任何错误。你能给我看一下你的蜘蛛的全部代码吗？如果你也需要items.py，请告诉我。正如你所看到的，现在它非常简单。我昨天花了一整天的时间在这个问题上--我真的对收益率的问题束手无策。现在我在看它--它可能是生成器函数中的continue语句。在没有continue的情况下测试--同样的问题。10个唯一的ID返回/111个重复的项目返回。啊，我的道歉！缩进在运行的代码中是正确的。那只是复制和粘贴错误。好吧，那它应该可以工作。不要看到任何错误。你能给我看一下你的蜘蛛的全部代码吗？如果你也需要items.py，请告诉我。正如你所看到的，现在它非常简单。我昨天花了一整天的时间在这个问题上--我真的对收益率的问题束手无策。现在我在看它--它可能是生成器函数中的continue语句。在没有continue的情况下测试--同样的问题。10个具有退货的唯一ID/111个具有收益的重复项目。