Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/vba/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scrapy Splash返回多个html快照时出现问题_Scrapy_Scrapy Spider_Scrapy Splash - Fatal编程技术网

Scrapy Splash返回多个html快照时出现问题

Scrapy Splash返回多个html快照时出现问题,scrapy,scrapy-spider,scrapy-splash,Scrapy,Scrapy Spider,Scrapy Splash,我尝试使用splash脚本返回多个html页面(在一个响应中,如文档中所示),并从中提取链接。但是我发现在response.text和response.body中,只要返回了多个页面,html内容就会改变。response.data的情况并非如此,它工作正常。为什么会这样 我在与文档中相同的代码(和网站)上尝试了这一点-(来自后面的部分,来自多个html快照的示例) 这是我的启动请求--> lua脚本如下所示--> 结果是: response.data--> {u'html':u'\n\n\n黑

我尝试使用splash脚本返回多个html页面(在一个响应中,如文档中所示),并从中提取链接。但是我发现在response.text和response.body中,只要返回了多个页面,html内容就会改变。response.data的情况并非如此,它工作正常。为什么会这样

我在与文档中相同的代码(和网站)上尝试了这一点-(来自后面的部分,来自多个html快照的示例)

这是我的启动请求-->

lua脚本如下所示-->

结果是:

response.data-->

{u'html':u'\n\n\n黑客新闻
response.text和response.body-->

u'[{“html”:“\\n\\n\\n黑客新闻
注意第二种情况下的额外\\。这些可能是转义字符或其他字符,但它们与使用response.text的LinkExtractor相混淆,导致链接中断。同样,只有在返回html响应数组时才会发生这种情况。
我在这里遗漏了什么?

Lua脚本正在返回一个结果数组,该数组将被转换为Python中与JSON兼容的结构。
response.text
中的值似乎正确

要传递给
LinkedExtractor
的HTML位于
json.loads(response.text)[i]['HTML']
,用于
0..2中的
i

In [1]: import json

In [2]: text = u'[{"html": "<html op=\\"news\\"><head><meta name=\\"referrer\\" content=\\"origin\\"><meta name=\\"viewport\\
   ...: " content=\\"width=device-width, initial-scale=1.0\\"><link rel=\\"stylesheet\\" type=\\"text/css\\" href=\\"news.css
   ...: ?i7azI8MkFRPfcPhHQ7HD\\">\\n <link rel=\\"shortcut icon\\" href=\\"favicon.ico\\">\\n <link rel=\\"alternate\\" type=
   ...: \\"application/rss+xml\\" title=\\"RSS\\" href=\\"rss\\">\\n <title>Hacker News</title></head>"}]'

In [3]: print(text)
[{"html": "<html op=\"news\"><head><meta name=\"referrer\" content=\"origin\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"><link rel=\"stylesheet\" type=\"text/css\" href=\"news.css?i7azI8MkFRPfcPhHQ7HD\">\n <link rel=\"shortcut icon\" href=\"favicon.ico\">\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"rss\">\n <title>Hacker News</title></head>"}]

In [4]: print(json.dumps(json.loads(text), indent=2))
[
  {
    "html": "<html op=\"news\"><head><meta name=\"referrer\" content=\"origin\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"><link rel=\"stylesheet\" type=\"text/css\" href=\"news.css?i7azI8MkFRPfcPhHQ7HD\">\n <link rel=\"shortcut icon\" href=\"favicon.ico\">\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"rss\">\n <title>Hacker News</title></head>"
  }
]

In [5]: print(json.loads(text)[0]['html'])
<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?i7azI8MkFRPfcPhHQ7HD">
 <link rel="shortcut icon" href="favicon.ico">
 <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
 <title>Hacker News</title></head>
[1]中的
:导入json
在[2]中:text=u'[{“html”:“\\n\\n\\n黑客新闻”}”
在[3]中:打印(文本)
[{“html”:“\n\n\n黑客新闻”}]
在[4]中:打印(json.dumps(json.loads(text),indent=2))
[
{
“html”:“\n\n\n黑客新闻”
}
]
在[5]中:打印(json.loads(text)[0]['html'])
黑客新闻

你是个救命恩人!它工作得很好。不过,我必须先从html中构造一个“TextResponse”,然后再将它传递给LinkedExtractor(让它工作)。
function page_info(splash, url)
   splash:go(url)
   local res = {
     html=splash:html(),
     }
   return res
end

function main(splash, args)
   local base = "https://news.ycombinator.com/news?p="
   local result = treat.as_array({})
   for i=1,3 do
      local url =  base .. i
      result[i] = page_info(splash, url)
   end
   return result
end
{u'html': u'<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?i7azI8MkFRPfcPhHQ7HD">\n <link rel="shortcut icon" href="favicon.ico">\n <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">\n <title>Hacker News</title>
 u'[{"html": "<html op=\\"news\\"><head><meta name=\\"referrer\\" content=\\"origin\\"><meta name=\\"viewport\\" content=\\"width=device-width, initial-scale=1.0\\"><link rel=\\"stylesheet\\" type=\\"text/css\\" href=\\"news.css?i7azI8MkFRPfcPhHQ7HD\\">\\n <link rel=\\"shortcut icon\\" href=\\"favicon.ico\\">\\n <link rel=\\"alternate\\" type=\\"application/rss+xml\\" title=\\"RSS\\" href=\\"rss\\">\\n <title>Hacker News</title></head>
In [1]: import json

In [2]: text = u'[{"html": "<html op=\\"news\\"><head><meta name=\\"referrer\\" content=\\"origin\\"><meta name=\\"viewport\\
   ...: " content=\\"width=device-width, initial-scale=1.0\\"><link rel=\\"stylesheet\\" type=\\"text/css\\" href=\\"news.css
   ...: ?i7azI8MkFRPfcPhHQ7HD\\">\\n <link rel=\\"shortcut icon\\" href=\\"favicon.ico\\">\\n <link rel=\\"alternate\\" type=
   ...: \\"application/rss+xml\\" title=\\"RSS\\" href=\\"rss\\">\\n <title>Hacker News</title></head>"}]'

In [3]: print(text)
[{"html": "<html op=\"news\"><head><meta name=\"referrer\" content=\"origin\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"><link rel=\"stylesheet\" type=\"text/css\" href=\"news.css?i7azI8MkFRPfcPhHQ7HD\">\n <link rel=\"shortcut icon\" href=\"favicon.ico\">\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"rss\">\n <title>Hacker News</title></head>"}]

In [4]: print(json.dumps(json.loads(text), indent=2))
[
  {
    "html": "<html op=\"news\"><head><meta name=\"referrer\" content=\"origin\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"><link rel=\"stylesheet\" type=\"text/css\" href=\"news.css?i7azI8MkFRPfcPhHQ7HD\">\n <link rel=\"shortcut icon\" href=\"favicon.ico\">\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"rss\">\n <title>Hacker News</title></head>"
  }
]

In [5]: print(json.loads(text)[0]['html'])
<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?i7azI8MkFRPfcPhHQ7HD">
 <link rel="shortcut icon" href="favicon.ico">
 <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
 <title>Hacker News</title></head>