Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/341.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 痒痒的。LinkExtractor中出现意外符号_Python_Python 2.7_Scrapy - Fatal编程技术网

Python 痒痒的。LinkExtractor中出现意外符号

Python 痒痒的。LinkExtractor中出现意外符号,python,python-2.7,scrapy,Python,Python 2.7,Scrapy,我正在研究Scrapy library并试图制作一个小爬虫 以下是爬虫程序的规则: rules = ( Rule(LinkExtractor(restrict_xpaths='//div[@class="wrapper"]/div[last()]/a[@class="pagenav"][last()]')), # Rule(LinkExtractor(restrict_xpaths='//span[@class="update_title"]/a'), callback='par

我正在研究Scrapy library并试图制作一个小爬虫

以下是爬虫程序的规则:

rules = (
    Rule(LinkExtractor(restrict_xpaths='//div[@class="wrapper"]/div[last()]/a[@class="pagenav"][last()]')),
    # Rule(LinkExtractor(restrict_xpaths='//span[@class="update_title"]/a'), callback='parse_item'),
)
但我得到了这个错误信息:

DEBUG: Crawled (200) <GET http://web/category.php?id=4&> (referer: None)
DEBUG: Crawled (404) <GET http://web/%0D%0Acategory.php?id=4&page=2&s=d> (referer: http://web/category.php?id=4&)
DEBUG: Ignoring response <404 http://web/%0D%0Acategory.php?id=4&page=2&s=d>: HTTP status code is not handled or not allowed
并将规则更改为

rules = (
    Rule(LinkExtractor(restrict_xpaths='//div[@class="wrapper"]/div[last()]/a[@class="pagenav"][last()]', process_value=process_value)),
    # Rule(LinkExtractor(restrict_xpaths='//span[@class="update_title"]/a'), callback='parse_item'),
)
“打印”命令将打印以下内容:

Crawled (200) <GET http://web/category.php?id=4&>(referer: None)
http://web/
category.php?id=4&page=2&s=d&
Crawled (404) <GET http://web/%0D%0Acategory.php?%0D=&id=4&page=2&s=d>(referer: http://web/category.php?id=4&)
已爬网(200)(引用者:无)
http://web/
category.php?id=4&page=2&s=d&
爬网(404)(参考:http://web/category.php?id=4&)

%0D
%0A
是HTML编码中的
CR
LF
字符

您解析的网站的作者将字符放入HTML文档中。我偶尔会这样想,因为它们在IDE或浏览器中不可见

解释不可见字符的含义:

还有更多关于编码的信息

我建议您以如下方式剥离所有需要获取的链接:

href = href.strip()

您可以显示提取
href
标记的代码吗?根据我的猜测,您需要先
删除相对url,然后提出请求。剥离将删除
回车符-%0D
换行符-%0A
字符。谢谢,但由于某些原因。剥离()不起作用:(在剥离
url=urllib.unquote(url.decode('utf8')之前尝试执行该操作)
。很抱歉回答不好:我的笔记本电脑用完了。实际上,在scrapy:中有关于这个问题的未决问题和PRs,似乎.strip()无法处理这个格式良好的HTML代码:D
Crawled (200) <GET http://web/category.php?id=4&>(referer: None)
http://web/
category.php?id=4&page=2&s=d&
Crawled (404) <GET http://web/%0D%0Acategory.php?%0D=&id=4&page=2&s=d>(referer: http://web/category.php?id=4&)
href = href.strip()