Python 刮论坛线程：如何从css边距属性计算后续关系？_Python_Xpath_Scrapy

Python 刮论坛线程：如何从css边距属性计算后续关系？

python xpath scrapy

Python 刮论坛线程：如何从css边距属性计算后续关系？,python,xpath,scrapy,Python,Xpath,Scrapy,我在这件事上一直很努力我的抓取工作的目标站点是一个老式论坛，在它们的索引页面中，每个线程都在一个标签中，每个帖子都在一个标签中。后续帖子的左边距将缩进20px，以表示这种关系 <div> <p style="margin:2px 0 17px 0px; width:705px"><a href="./6368972.html" class="post">original post</a>other stuff</p> <

我在这件事上一直很努力

我的抓取工作的目标站点是一个老式论坛，在它们的索引页面中，每个线程都在一个

标签中，每个帖子都在一个

标签中。后续帖子的左边距将缩进20px，以表示这种关系

<div>
  <p style="margin:2px 0 17px 0px; width:705px"><a href="./6368972.html" class="post">original post</a>other stuff</p>
  <p style="margin:2px 0 2px 20px; width:683px"><a href="./6368973.html" class="post">reply post</a>other stuff</p>
  <p style="margin:2px 0 2px 40px; width:661px"><a href="./6368974.html" class="post">reply post</a>other stuff</p>
  ...
</div>

您可以使用

re:test

xpath表达式将样式属性与某些正则表达式匹配：

>[1]: sel.xpath('//p[re:test(@style,"margin[^;]+20px")]').extract()
<[1]: ['<p style="margin:2px 0 2px 20px; width:683px"><a href="./6368973.html" class="post">reply post</a>other stuff</p>']

[1]：sel.xpath（'//p[re:test（@style，“margin[^；]+20px”）].extract（）
您可以使用re:test
xpath表达式将样式属性与某些正则表达式匹配：
>[1]: sel.xpath('//p[re:test(@style,"margin[^;]+20px")]').extract()
<[1]: ['<p style="margin:2px 0 2px 20px; width:683px"><a href="./6368973.html" class="post">reply post</a>other stuff</p>']

[1]：sel.xpath（'//p[re:test（@style，“margin[^；]+20px”）].extract（）
这未经测试，但可能有效：
parent = list()
for p in div.xpath('./p'):
    post = dict()
    # do whatever extraction from post here -- title, datetime etc.
    # post['title'] = p.xpath(...)
    # ...
    post['url'] = p.xpath('./a/@href').extract_first()

    post['reply_to'] = parent.pop() if len(parent) else None
    margin = int(p.xpath('./@style').re_first('.* (\d+)px;.*'))

    next_p = p.xpath('./following-sibling::p[1]')
    if next_p:
        next_margin = int(next_p.xpath('./@style').re_first('.* (\d+)px;.*'))
        if next_margin > margin:
            # next post is a reply to this post
            if post['reply_to']:
                parent.append(post['reply_to'])
            parent.append(post['url'])
        elif next_margin == margin:
            # next post is a reply to direct parent post
            parent.append(post['reply_to'])
        else:
            # next post if a reply to some distant parent post
            for _ in range((margin - next_margin) / 20 - 1):
                parent.pop()

    yield post

基本上，它使用堆栈在线程树中存储指向父帖子的链接。通过这种方式，您不必来回搜索树来查找当前回复的帖子，但可以只访问每个节点一次（嗯，两次，因为您总是查看下一个同级）
使用XPath和正则表达式可能会更容易，但我认为Scrapy选择器只使用XPath 1.0，而XPath 1.0不支持这一点。如果我错了，请纠正我。
这未经测试，但可能有效：
parent = list()
for p in div.xpath('./p'):
    post = dict()
    # do whatever extraction from post here -- title, datetime etc.
    # post['title'] = p.xpath(...)
    # ...
    post['url'] = p.xpath('./a/@href').extract_first()

    post['reply_to'] = parent.pop() if len(parent) else None
    margin = int(p.xpath('./@style').re_first('.* (\d+)px;.*'))

    next_p = p.xpath('./following-sibling::p[1]')
    if next_p:
        next_margin = int(next_p.xpath('./@style').re_first('.* (\d+)px;.*'))
        if next_margin > margin:
            # next post is a reply to this post
            if post['reply_to']:
                parent.append(post['reply_to'])
            parent.append(post['url'])
        elif next_margin == margin:
            # next post is a reply to direct parent post
            parent.append(post['reply_to'])
        else:
            # next post if a reply to some distant parent post
            for _ in range((margin - next_margin) / 20 - 1):
                parent.pop()

    yield post

基本上，它使用堆栈在线程树中存储指向父帖子的链接。通过这种方式，您不必来回搜索树来查找当前回复的帖子，但可以只访问每个节点一次（嗯，两次，因为您总是查看下一个同级）
使用XPath和正则表达式可能会更容易，但我认为Scrapy选择器只使用XPath 1.0，而XPath 1.0不支持这一点。如果我错了，请纠正我。
不太清楚你到底想做什么。您是否可以共享所提供的HTML
源代码的所需输出？您到底想做什么还不太清楚。你能分享关于所提供的HTML
源代码的期望输出吗？这看起来很有希望，我坐在电脑前会玩一下。同时，这看起来真的很有希望，我坐在电脑前会玩这个。同时，老实说，我也不明白中的下划线\uu
的范围（（margin-next\u margin）/20-1）：
…工作得很有魅力，这正是我想要的！对于需要此代码的用户，请将（（margin-next\u margin）/20-1）
转换为整数；另外，您可能需要将yield post更改为其他内容，至少当我在一个粗糙的shell中尝试此操作时，代码还没有添加到我的爬虫文件中。谢谢你@TomášLinhart@关于下划线，我只需要在那里循环几次，但我对range
产生的特定值不感兴趣。下划线是变量的有效标识符，在本用例中，服务器作为某种类型的伪标识符。请参阅以获得更好的解释。老实说，我不明白中的下划线\uu
。范围内（（margin-next\u margin）/20-1）：
…工作得很有魅力，正是我想要的！对于需要此代码的用户，请将（（margin-next\u margin）/20-1）
转换为整数；另外，您可能需要将yield post更改为其他内容，至少当我在一个粗糙的shell中尝试此操作时，代码还没有添加到我的爬虫文件中。谢谢你@TomášLinhart@关于下划线，我只需要在那里循环几次，但我对range
产生的特定值不感兴趣。下划线是变量的有效标识符，在本用例中，服务器作为某种类型的伪标识符。请参阅以获得更好的解释。