Python 痒痒的。LinkExtractor中出现意外符号_Python_Python 2.7_Scrapy

Python 痒痒的。LinkExtractor中出现意外符号

python python-2.7 scrapy

Python 痒痒的。LinkExtractor中出现意外符号,python,python-2.7,scrapy,Python,Python 2.7,Scrapy,我正在研究Scrapy library并试图制作一个小爬虫以下是爬虫程序的规则： rules = ( Rule(LinkExtractor(restrict_xpaths='//div[@class="wrapper"]/div[last()]/a[@class="pagenav"][last()]')), # Rule(LinkExtractor(restrict_xpaths='//span[@class="update_title"]/a'), callback='par

我正在研究Scrapy library并试图制作一个小爬虫

以下是爬虫程序的规则：

rules = (
    Rule(LinkExtractor(restrict_xpaths='//div[@class="wrapper"]/div[last()]/a[@class="pagenav"][last()]')),
    # Rule(LinkExtractor(restrict_xpaths='//span[@class="update_title"]/a'), callback='parse_item'),
)

但我得到了这个错误信息：

DEBUG: Crawled (200) <GET http://web/category.php?id=4&> (referer: None)
DEBUG: Crawled (404) <GET http://web/%0D%0Acategory.php?id=4&page=2&s=d> (referer: http://web/category.php?id=4&)
DEBUG: Ignoring response <404 http://web/%0D%0Acategory.php?id=4&page=2&s=d>: HTTP status code is not handled or not allowed

并将规则更改为

rules = (
    Rule(LinkExtractor(restrict_xpaths='//div[@class="wrapper"]/div[last()]/a[@class="pagenav"][last()]', process_value=process_value)),
    # Rule(LinkExtractor(restrict_xpaths='//span[@class="update_title"]/a'), callback='parse_item'),
)

“打印”命令将打印以下内容：

Crawled (200) <GET http://web/category.php?id=4&>(referer: None)
http://web/
category.php?id=4&page=2&s=d&
Crawled (404) <GET http://web/%0D%0Acategory.php?%0D=&id=4&page=2&s=d>(referer: http://web/category.php?id=4&)

已爬网（200）（引用者：无）
http://web/
category.php？id=4&page=2&s=d&
爬网（404）（参考：http://web/category.php?id=4&)
%0D
和%0A
是HTML编码中的CR
和LF
字符
您解析的网站的作者将字符放入HTML文档中。我偶尔会这样想，因为它们在IDE或浏览器中不可见
解释不可见字符的含义：




还有更多关于编码的信息
我建议您以如下方式剥离所有需要获取的链接：
href = href.strip()

您可以显示提取href
标记的代码吗？根据我的猜测，您需要先删除相对url，然后提出请求。剥离将删除回车符-%0D
和换行符-%0A
字符。谢谢，但由于某些原因。剥离（）不起作用：（在剥离url=urllib.unquote（url.decode（'utf8'）之前尝试执行该操作）。很抱歉回答不好：我的笔记本电脑用完了。实际上，在scrapy:中有关于这个问题的未决问题和PRs，似乎.strip（）无法处理这个格式良好的HTML代码：D
Crawled (200) <GET http://web/category.php?id=4&>(referer: None)
http://web/
category.php?id=4&page=2&s=d&
Crawled (404) <GET http://web/%0D%0Acategory.php?%0D=&id=4&page=2&s=d>(referer: http://web/category.php?id=4&)

href = href.strip()