Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 用scrapy爬行非拉丁语域_Python_Url_Scrapy_Httprequest_Scrapy Spider - Fatal编程技术网

Python 用scrapy爬行非拉丁语域

Python 用scrapy爬行非拉丁语域,python,url,scrapy,httprequest,scrapy-spider,Python,Url,Scrapy,Httprequest,Scrapy Spider,我需要用scrapy在“€ф”域区域中抓取一些网站。Url的结构如下所示:“”(此Url不是真实的,例如给出的)。当然,我尝试使用的网站可以通过浏览器访问。 我尝试使用start\u url属性开始爬网,例如: start_urls = ['http://сайтдляпримера.рф'] 以及start\u请求功能: def start_requests(self): return [scrapy.Request("http://сайтдляпримера.рф/", call

我需要用scrapy在“€ф”域区域中抓取一些网站。Url的结构如下所示:“”(此Url不是真实的,例如给出的)。当然,我尝试使用的网站可以通过浏览器访问。 我尝试使用
start\u url
属性开始爬网,例如:

start_urls = ['http://сайтдляпримера.рф']
以及
start\u请求
功能:

def start_requests(self):
    return [scrapy.Request("http://сайтдляпримера.рф/", callback=self._test)]
它们都没有按预期工作,我收到了以下控制台消息:

2016-01-01 19:02:01 [scrapy] INFO: Spider opened
2016-01-01 19:02:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-01 19:02:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 1 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 2 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Gave up retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 3 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] ERROR: Error downloading <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84>: DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] INFO: Closing spider (finished)
2016-01-01 19:02:01[scrapy]信息:蜘蛛打开
2016-01-01 19:02:01[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2016-01-01 19:02:01[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2016-01-01 19:02:01[scrapy]调试:重试(失败1次):DNS查找失败:地址“%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0%D0%D1%80%D0%D0%D1%80%D1%84”找不到:[错误号-2]名称或服务未知。
2016-01-01 19:02:01[scrapy]调试:重试(失败2次):DNS查找失败:地址“%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0%D0%D1%80%D0%D0%D1%80%84”找不到:[错误号-2]名称或服务未知。
2016-01-01 19:02:01[scrapy]调试:放弃重试(失败3次):DNS查找失败:地址“%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0%D0%D1%80%D0%D1%84”找不到:[错误号-2]名称或服务未知。
2016-01-01 19:02:01[scrapy]错误:下载错误:DNS查找失败:地址“%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0%D1%84”找不到:[Errno-2]名称或服务未知。
2016-01-01 19:02:01[scrapy]信息:关闭卡盘(已完成)
*如果有必要,我需要在基于Linux的操作系统上使用scrapy


有什么解决办法吗?如果可能的话,有没有办法从
\u spider
文件中解决这个问题,因为我无法访问framework的存储库(那里没有修改任何处理http请求的内容)

在处理国际化域名(IDN)时,您需要使用
idna
对非ascii URL进行编码。然后,您需要将生成的字节解码为unicode字符串。还请注意,组成协议名称(“http://”)的url的ascii子字符串应单独加前缀,以便在执行
idna
编码时不会弄乱:

'http://' + u'сайтдляпримера.рф'.encode('idna').decode('utf-8')

有关更多详细信息,请参见。

遇到错误
异常。TypeError:当我尝试执行类似
“http://Сааааааааааааааааааа
然后我得到
xn--http://-8fga3bl9al3aq0crdjw9y.xn--p1ai
不受支持:不支持的URL方案“xn--http”:该方案没有可用的处理程序
@Helvdan可以尝试其他方法:
'http://Сааааааааааааааааа。这将把字节转换为unicode字符串。这是相同的错误
异常。TypeError:必须是unicode,而不是str
,因为
'somestr'.method1().method2()
是一个链调用。所以与您的原始答案没有区别。@Helvdan好的,问题似乎是由url的拉丁前缀编码引起的。试试这个:
“http://'+”Саааааааааааааааааа。我自己用
urlopen
试了一下,效果不错。希望它也适用于
scrapy
你是救生员。这已经奏效了。你能修改你的答案以匹配你最后的评论吗?