Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/ssh/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 通过Scrapy删除Google分析_Python_Ajax_Web Scraping_Scrapy - Fatal编程技术网

Python 通过Scrapy删除Google分析

Python 通过Scrapy删除Google分析,python,ajax,web-scraping,scrapy,Python,Ajax,Web Scraping,Scrapy,我一直在尝试使用Scrapy从Google Analytics获取一些数据,尽管我是一个完全的Python新手,但我已经取得了一些进展。 我现在可以通过Scrapy登录到Google Analytics,但我需要发出一个AJAX请求来获取我想要的数据。我的错误日志显示,我尝试用下面的代码复制浏览器的HTTP请求头,但似乎不起作用 太多的值无法解压缩 有人能帮忙吗?我已经做了两天了,我感觉我很接近,但我也很困惑 代码如下: from scrapy.spider import BaseSpider

我一直在尝试使用Scrapy从Google Analytics获取一些数据,尽管我是一个完全的Python新手,但我已经取得了一些进展。 我现在可以通过Scrapy登录到Google Analytics,但我需要发出一个AJAX请求来获取我想要的数据。我的错误日志显示,我尝试用下面的代码复制浏览器的HTTP请求头,但似乎不起作用

太多的值无法解压缩

有人能帮忙吗?我已经做了两天了,我感觉我很接近,但我也很困惑

代码如下:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import  logging
from super.items import SuperItem
from scrapy.shell import inspect_response
import json

class LoginSpider(BaseSpider):
    name = 'super'
    start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier']

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'Email': 'Email'},

                    callback=self.log_password)]


    def log_password(self, response):
        return [FormRequest.from_response(response,
                    formdata={'Passwd': 'Password'},

                    callback=self.after_login)]

    def after_login(self, response):
      if "authentication failed" in response.body:
        self.log("Login failed", level=logging.ERROR)
        return
    # We've successfully authenticated, let's have some fun!
      else:
        print("Login Successful!!")
        return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0",
               method='POST',
               headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
                         'Galaxy-Ajax': 'true',
                         'Origin': 'https://analytics.google.com',
                         'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1',
                         'User-Agent': 'My-user-agent',
                         'X-GAFE4-XSRF-TOKEN': 'Mytoken'}],
               callback=self.parse_tastypage, dont_filter=True)


    def parse_tastypage(self, response):
        response = json.loads(jsonResponse)

        inspect_response(response, self)
        yield item
下面是日志的一部分:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_login
    callback=self.parse_tastypage, dont_filter=True)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__
    self.headers = Headers(headers or {}, encoding=encoding)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__
    super(Headers, self).__init__(seq)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__
    self.update(seq)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in update
    super(CaselessDict, self).update(iseq)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr>
    iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 2,
 'downloader/response_bytes': 75986,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),
 'log_count/DEBUG': 6,
2016-03-28 19:11:39[剪贴]信息:启用的项目管道:
[]
2016-03-28 19:11:39[剪贴]信息:蜘蛛打开
2016-03-28 19:11:39[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2016-03-28 19:11:39[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2016-03-28 19:11:40[scrapy]调试:爬网(200)(参考:无)
2016-03-28 19:11:46[scrapy]调试:爬网(200)(参考:https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50[scrapy]调试:重定向(302)到
2016-03-28 19:11:57[scrapy]调试:重定向(302)到
2016-03-28 19:12:01[scrapy]调试:爬网(200)(参考:https://accounts.google.com/AccountLoginInfo)
登录成功!!
2016-03-28 19:12:01[scrapy]错误:蜘蛛错误处理(参考:https://accounts.google.com/AccountLoginInfo)
回溯(最近一次呼叫最后一次):
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/Python/twisted/internet/defer.py”,第577行,在runCallbacks中
current.result=回调(current.result,*args,**kw)
文件“/Users/aminbouraiss/super/super/spider/mySuper.py”,登录后第42行
callback=self.parse\u tastypage,dont\u filter=True)
文件“/Library/Python/2.7/site packages/Scrapy-1.1.0rc3-py2.7.egg/Scrapy/http/request/_init__.py”,第35行,在_init中__
self.headers=headers(headers或{},encoding=encoding)
文件“/Library/Python/2.7/site packages/Scrapy-1.1.0rc3-py2.7.egg/Scrapy/http/headers.py”,第12行,在__
超级(标题,自我)。\uuuuu初始\uuuuuuu(序列)
文件“/Library/Python/2.7/site packages/Scrapy-1.1.0rc3-py2.7.egg/Scrapy/utils/datatypes.py”,第193行,在__
自我更新(seq)
文件“/Library/Python/2.7/site packages/Scrapy-1.1.0rc3-py2.7.egg/Scrapy/utils/datatypes.py”,第229行,在更新中
超级(无案例、自我)。更新(iseq)
文件“/Library/Python/2.7/site packages/Scrapy-1.1.0rc3-py2.7.egg/Scrapy/utils/datatypes.py”,第228行,在
iseq=((self.normkey(k),self.normvalue(v))表示k,v在seq中)
ValueError:要解压缩的值太多
2016-03-28 19:12:01[scrapy]信息:关闭卡盘(已完成)
2016-03-28 19:12:01[刮痧]信息:倾销刮痧统计数据:
{'downloader/request_bytes':6419,
“下载程序/请求计数”:5,
“下载程序/请求方法\计数/获取”:3,
“下载程序/请求方法/计数/发布”:2,
“downloader/response_字节”:75986,
“下载程序/响应计数”:5,
“下载/响应状态\计数/200”:3,
“下载程序/响应状态\计数/302”:2,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2016,3,28,23,12,1824033),
“日志计数/调试”:6,

您的错误是因为标题需要是dict,而不是dict中的列表:

  headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',

                          'Galaxy-Ajax': 'true',
                          'Origin': 'https://analytics.google.com',
                          'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1',
                          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',
                          },
这将解决您当前的问题,但您将获得411,因为您还需要指定内容长度,如果您添加要从中提取的内容,我将能够向您展示如何进行。您可以看到以下输出:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1)
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed
2016-03-29 14:02:11[scrapy]调试:重定向(302)到
2016-03-29 14:02:13[scrapy]调试:爬网(200)(参考:https://accounts.google.com/AccountLoginInfo)
登录成功!!
2016-03-29 14:02:14[scrapy]调试:爬网(411)(参考:https://analytics.google.com/analytics/web/?hl=fr&pli=1)
2016-03-29 14:02:14[scrapy]调试:忽略响应:HTTP状态代码未处理或不允许

我正在尝试获取一些无法通过APIRAC获取的数据。谢谢Padraic,我欠你一杯啤酒!我更改了http请求头,终于成功了。@gerardbaste,没问题,很高兴你对其进行了排序,很高兴进行分析。