使用python机器人解析器_Python_Screen Scraping_Web

使用python机器人解析器

python web

使用python机器人解析器,python,screen-scraping,web,Python,Screen Scraping,Web,我不知道如何在robotparser模块中使用parse函数。以下是我尝试过的： In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt") In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead. #

我不知道如何在robotparser模块中使用parse函数。以下是我尝试过的：

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True

似乎rp.entries是[]。我不明白出了什么问题。我试过一个更简单的例子，但问题是一样的。

这里有两个问题。首先，

rp.parse

方法获取字符串列表，因此您应该将

.split（“\n”）

添加到该行

第二个问题是

用户代理的规则存储在

rp.default\u条目中，而不是rp.entries
。如果您检查它是否包含条目
对象
我不确定是谁出了错，但解析器的Python实现只考虑了第一个用户代理：
部分，因此在您给出的示例中，不允许使用/next/
。其他不允许的行将被忽略。我还没有读过规范，所以我不能说这是一个格式错误的robots.txt文件，或者Python代码是错误的。不过我会假设前者。
我刚刚找到了答案
一,。问题是这个robots.txt[来自wordpress.com]包含多个用户代理声明。robotparser模块不支持这一点。我做了一个小小的修改，删除了过多的用户代理：
行，解决了这个问题
二,。要解析的参数是由Andrew指出的list