Ajax 通过post方法获得数据,但得到403

Ajax 通过post方法获得数据,但得到403,ajax,post,web-scraping,scrapy,scrapy-spider,Ajax,Post,Web Scraping,Scrapy,Scrapy Spider,我用F12Chrome和邮递员在网站上查看了请求及其详细信息 电邮:建国。bai@hirebigdata.cn,密码:wsc111111,然后转到 我想得到海努扎所关注的所有专栏,目前是105篇。当打开页面时,只有20个,然后我需要向下滚动以获取更多。每次向下滚动时,请求的详细信息如下所示: Remote Address:60.28.215.70:80 Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2 Reque

我用F12Chrome和邮递员在网站上查看了请求及其详细信息

电邮:建国。bai@hirebigdata.cn,密码:wsc111111,然后转到

我想得到海努扎所关注的所有专栏,目前是105篇。当打开页面时,只有20个,然后我需要向下滚动以获取更多。每次向下滚动时,请求的详细信息如下所示:

Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097
# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request


class PostSpider(scrapy.Spider):
    name = "post"
    allowed_domains = ["zhihu.com"]
    start_urls = (
        'http://www.zhihu.com',
    )

    def __init__(self):
        super(PostSpider, self).__init__()

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'email': 'jianguo.bai@hirebigdata.cn', 'password': 'wsc111111'},
            callback=self.login,
        )

    def login(self, response):
        yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
                      callback=self.parse_followed_columns)

    def parse_followed_columns(self, response):
        # here deal with the first 20 divs
        params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
        method = 'next'
        _xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
        data = {
            'params': params,
            'method': method,
            '_xsrf': _xsrf,
        }
        r = Request(
            "http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
            method='POST',
            body=urllib.urlencode(data),
            headers={
                'Accept': '*/*',
                'Accept-Encoding': 'gzip,deflate',
                'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
                'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                'Cache-Control': 'no-cache',
                'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
                          'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
                          '__utmt=1; '
                          '__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
                          '__utmb=51854390.2.10.1419902703; '
                          '__utmc=51854390; '
                          '__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
                          '__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
                'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
                'host': 'www.zhihu.com',
                'Origin': 'http://www.zhihu.com',
                'Connection': 'keep-alive',
                'X-Requested-With': 'XMLHttpRequest',
            },
            callback=self.parse_more)
        r.headers['Cookie'] += response.request.headers['Cookie']
        print r.headers
        yield r
        print "after"

    def parse_more(self, response):
        # here is where I want to get the returned divs
        print response.url
        followers = response.xpath("//div[@class='zm-profile-card "
                                   "zm-profile-section-item zg-clear no-hovercard']")
        print len(followers)
2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed
然后我使用postman模拟请求,如下所示:

Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097
# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request


class PostSpider(scrapy.Spider):
    name = "post"
    allowed_domains = ["zhihu.com"]
    start_urls = (
        'http://www.zhihu.com',
    )

    def __init__(self):
        super(PostSpider, self).__init__()

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'email': 'jianguo.bai@hirebigdata.cn', 'password': 'wsc111111'},
            callback=self.login,
        )

    def login(self, response):
        yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
                      callback=self.parse_followed_columns)

    def parse_followed_columns(self, response):
        # here deal with the first 20 divs
        params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
        method = 'next'
        _xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
        data = {
            'params': params,
            'method': method,
            '_xsrf': _xsrf,
        }
        r = Request(
            "http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
            method='POST',
            body=urllib.urlencode(data),
            headers={
                'Accept': '*/*',
                'Accept-Encoding': 'gzip,deflate',
                'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
                'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                'Cache-Control': 'no-cache',
                'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
                          'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
                          '__utmt=1; '
                          '__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
                          '__utmb=51854390.2.10.1419902703; '
                          '__utmc=51854390; '
                          '__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
                          '__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
                'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
                'host': 'www.zhihu.com',
                'Origin': 'http://www.zhihu.com',
                'Connection': 'keep-alive',
                'X-Requested-With': 'XMLHttpRequest',
            },
            callback=self.parse_more)
        r.headers['Cookie'] += response.request.headers['Cookie']
        print r.headers
        yield r
        print "after"

    def parse_more(self, response):
        # here is where I want to get the returned divs
        print response.url
        followers = response.xpath("//div[@class='zm-profile-card "
                                   "zm-profile-section-item zg-clear no-hovercard']")
        print len(followers)
2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed
正如你们所看到的,它得到了我想要的,甚至在我登出这个网站的时候也起了作用

根据这一切,我写下我的蜘蛛如下:

Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097
# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request


class PostSpider(scrapy.Spider):
    name = "post"
    allowed_domains = ["zhihu.com"]
    start_urls = (
        'http://www.zhihu.com',
    )

    def __init__(self):
        super(PostSpider, self).__init__()

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'email': 'jianguo.bai@hirebigdata.cn', 'password': 'wsc111111'},
            callback=self.login,
        )

    def login(self, response):
        yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
                      callback=self.parse_followed_columns)

    def parse_followed_columns(self, response):
        # here deal with the first 20 divs
        params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
        method = 'next'
        _xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
        data = {
            'params': params,
            'method': method,
            '_xsrf': _xsrf,
        }
        r = Request(
            "http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
            method='POST',
            body=urllib.urlencode(data),
            headers={
                'Accept': '*/*',
                'Accept-Encoding': 'gzip,deflate',
                'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
                'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                'Cache-Control': 'no-cache',
                'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
                          'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
                          '__utmt=1; '
                          '__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
                          '__utmb=51854390.2.10.1419902703; '
                          '__utmc=51854390; '
                          '__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
                          '__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
                'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
                'host': 'www.zhihu.com',
                'Origin': 'http://www.zhihu.com',
                'Connection': 'keep-alive',
                'X-Requested-With': 'XMLHttpRequest',
            },
            callback=self.parse_more)
        r.headers['Cookie'] += response.request.headers['Cookie']
        print r.headers
        yield r
        print "after"

    def parse_more(self, response):
        # here is where I want to get the returned divs
        print response.url
        followers = response.xpath("//div[@class='zm-profile-card "
                                   "zm-profile-section-item zg-clear no-hovercard']")
        print len(followers)
2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed
然后我得到了403,就像这样:

Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097
# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request


class PostSpider(scrapy.Spider):
    name = "post"
    allowed_domains = ["zhihu.com"]
    start_urls = (
        'http://www.zhihu.com',
    )

    def __init__(self):
        super(PostSpider, self).__init__()

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'email': 'jianguo.bai@hirebigdata.cn', 'password': 'wsc111111'},
            callback=self.login,
        )

    def login(self, response):
        yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
                      callback=self.parse_followed_columns)

    def parse_followed_columns(self, response):
        # here deal with the first 20 divs
        params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
        method = 'next'
        _xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
        data = {
            'params': params,
            'method': method,
            '_xsrf': _xsrf,
        }
        r = Request(
            "http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
            method='POST',
            body=urllib.urlencode(data),
            headers={
                'Accept': '*/*',
                'Accept-Encoding': 'gzip,deflate',
                'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
                'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                'Cache-Control': 'no-cache',
                'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
                          'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
                          '__utmt=1; '
                          '__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
                          '__utmb=51854390.2.10.1419902703; '
                          '__utmc=51854390; '
                          '__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
                          '__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
                'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
                'host': 'www.zhihu.com',
                'Origin': 'http://www.zhihu.com',
                'Connection': 'keep-alive',
                'X-Requested-With': 'XMLHttpRequest',
            },
            callback=self.parse_more)
        r.headers['Cookie'] += response.request.headers['Cookie']
        print r.headers
        yield r
        print "after"

    def parse_more(self, response):
        # here is where I want to get the returned divs
        print response.url
        followers = response.xpath("//div[@class='zm-profile-card "
                                   "zm-profile-section-item zg-clear no-hovercard']")
        print len(followers)
2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed
因此,它再也不会进入parse_了


我已经工作了两天,仍然没有得到任何帮助或建议,我们将不胜感激。

登录顺序正确。但是,解析的列方法完全破坏了会话

不能对数据[''u xsrf']和参数['hash\u id']使用硬编码值

您应该找到一种方法,直接从上一页的html内容中读取这些信息,并动态地注入这些值


另外,我建议您删除此请求中的headers参数,这只会导致问题

我想你不应该在信中提到你的证件here@NaingLinAung没关系,这个帐户只是测试用的。通过使用这个测试帐户,你们可以节省一些时间。我尝试了你们所说的,并且从上一页读取了xsrf和hash_id,为了简单起见,我将它放在那里。