Python Scrapy在通过凭据登录后不会在LinkedIn上抓取数据
我正在尝试从linkedin中添加我的组中抓取成员列表 虽然当我运行代码时,没有得到任何响应/值。还有一大堆错误 我已经验证了我的解析代码,看起来不错 这是我的代码:Python Scrapy在通过凭据登录后不会在LinkedIn上抓取数据,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在尝试从linkedin中添加我的组中抓取成员列表 虽然当我运行代码时,没有得到任何响应/值。还有一大堆错误 我已经验证了我的解析代码,看起来不错 这是我的代码: import scrapy from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from scrapy.http import FormRequest
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.http import FormRequest
class LoginSpider(BaseSpider):
name = 'jiju'
start_urls = ['https://www.linkedin.com/groups/58888/members']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'myusername', 'password': 'mypassword'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
for j in response.xpath('//*[@id="ember2299"]'):
yield {
'detail':j.xpath('//*[@id="ember2299"]/span').extract(),
}
这就是我得到的回应
C:\Users\Yash\tutorial\tutorial\spiders\jij.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
from scrapy.contrib.spiders.init import InitSpider
C:\Users\Yash\tutorial\tutorial\spiders\jij.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders.init` is deprecated, use `scrapy.spiders.init` instead
from scrapy.contrib.spiders.init import InitSpider
C:\Users\Yash\tutorial\tutorial\spiders\jij.py:6: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import BaseSpider
C:\Users\Yash\tutorial\tutorial\spiders\jiju.py:7: ScrapyDeprecationWarning: tutorial.spiders.jiju.LoginSpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others)
class LoginSpider(BaseSpider):
2018-08-03 00:51:07 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tutorial)
2018-08-03 00:51:07 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders']}
2018-08-03 00:51:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-08-03 00:51:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-03 00:51:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-03 00:51:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-03 00:51:07 [scrapy.core.engine] INFO: Spider opened
2018-08-03 00:51:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-03 00:51:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-03 00:51:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.linkedin.com/uas/login?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fgroups%2F58888%2Fmembers> from <GET https://www.linkedin.com/groups/58888/members>
2018-08-03 00:51:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.linkedin.com/start/join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fgroups%2F58888%2Fmembers&trk=login_reg_redirect> from <GET https://www.linkedin.com/uas/login?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fgroups%2F58888%2Fmembers>
2018-08-03 00:51:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.linkedin.com/start/join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fgroups%2F58888%2Fmembers&trk=login_reg_redirect> (referer: None)
2018-08-03 00:51:08 [scrapy.core.engine] DEBUG: Crawled (422) <POST https://www.linkedin.com/start/reg/api/createAccount> (referer: https://www.linkedin.com/start/join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fgroups%2F58888%2Fmembers&trk=login_reg_redirect)
2018-08-03 00:51:08 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <422 https://www.linkedin.com/start/reg/api/createAccount>: HTTP status code is not handled or not allowed
2018-08-03 00:51:08 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-03 00:51:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2810,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 20952,
'downloader/response_count': 4,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 2,
'downloader/response_status_count/422': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 8, 2, 19, 21, 8, 574170),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/422': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2018, 8, 2, 19, 21, 7, 742810)}
2018-08-03 00:51:08 [scrapy.core.engine] INFO: Spider closed (finished)
C:\Users\Yash\tutorial\tutorial\spider\jij.py:1:scrapydeproduction警告:模块'scrapy.contrib.spider'已被弃用,请改用'scrapy.spider'
从scrapy.contrib.spider.init导入InitSpider
C:\Users\Yash\tutorial\tutorial\spider\jij.py:1:scrapydeproductionwarning:Module`scrapy.contrib.spider.init`已被弃用,请改用`scrapy.spider.init`
从scrapy.contrib.spider.init导入InitSpider
C:\Users\Yash\tutorial\tutorial\spider\jij.py:6:scrapydeproduction警告:模块'scrapy.spider'已被弃用,请改用'scrapy.spider'
从scrapy.spider导入BaseSpider
C:\Users\Yash\tutorial\tutorial\Spider\jiju.py:7:scrapydeproductionwarning:tutorial.Spider.jiju.LoginSpider继承自不推荐的类scrapy.Spider.BaseSpider,请继承自scrapy.Spider.Spider。(警告仅限于第一个子类,可能还有其他子类)
类LoginSpider(BaseSpider):
2018-08-03 00:51:07[scrapy.utils.log]信息:scrapy 1.4.0已启动(机器人程序:教程)
2018-08-03 00:51:07[scrapy.utils.log]信息:覆盖的设置:{'BOT_NAME':'tutorial','NEWSPIDER_MODULE':'tutorial.SPIDER','SPIDER_MODULES':['tutorial.SPIDER']
2018-08-03 00:51:07[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.corestats.corestats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.logstats']
2018-08-03 00:51:07[剪贴簿中间件]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2018-08-03 00:51:07[剪贴簿中间件]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2018-08-03 00:51:07[碎片中间件]信息:启用的项目管道:
[]
2018-08-03 00:51:07[刮屑.堆芯.发动机]信息:十字轴已打开
2018-08-03 00:51:07[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2018-08-03 00:51:07[scrapy.extensions.telnet]调试:telnet控制台监听127.0.0.1:6023
2018-08-03 00:51:07[scrapy.downloadermiddleware.redirect]调试:重定向(302)到
2018-08-03 00:51:08[scrapy.downloadermiddleware.redirect]调试:重定向(302)到
2018-08-03 00:51:08[刮屑核心引擎]调试:爬网(200)(参考:无)
2018-08-03 00:51:08[刮屑核心引擎]调试:爬网(422)(参考:https://www.linkedin.com/start/join?session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fgroups%2F58888%2Fmembers&trk=login_reg_redirect)
2018-08-03 00:51:08[scrapy.spidermiddleware.httperror]信息:忽略响应:HTTP状态代码未处理或不允许
2018-08-03 00:51:08[刮屑芯发动机]信息:关闭卡盘(已完成)
2018-08-03 00:51:08[scrapy.statscollectors]信息:倾销scrapy统计数据:
{'downloader/request_bytes':2810,
“下载程序/请求计数”:4,
“下载程序/请求方法\计数/获取”:3,
“下载程序/请求方法\计数/发布”:1,
“downloader/response_字节”:20952,
“下载程序/响应计数”:4,
“下载程序/响应状态\计数/200”:1,
“下载程序/响应状态\计数/302”:2,
“下载程序/响应状态\计数/422”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2018,8,2,19,21,8574170),
“httperror/response\u忽略\u计数”:1,
“httperror/response\u ignored\u status\u count/422”:1,
“日志计数/调试”:5,
“日志计数/信息”:8,
“请求深度最大值”:1,
“响应\u已收到\u计数”:2,
“调度程序/出列”:4,
“调度程序/出列/内存”:4,
“调度程序/排队”:4,
“调度程序/排队/内存”:4,
“开始时间”:datetime.datetime(2018,8,2,19,21,7742810)}
2018-08-03 00:51:08[刮屑堆芯发动机]信息:十字轴关闭(完成)
问题:
Scrapy
尝试访问start\u URL
,您的情况是:https://www.linkedin.com/groups/58888/members
由于此请求已发出,而您尚未登录,LinkedIn
将您重定向
到<代码>https://www.linkedin.com/start/join,这是一个用于创建新用户的页面
您的parse
函数尝试在此页面上查找表单,并使用您的凭据设置输入字段username
和password
。
由于注册表单包含password
字段,Scrapy
试图将表单和您的数据发布到https://www.linkedin.com/start/reg/api/createAccount
,失败了,这就是LinkedIn返回422
错误的原因
解决方案:
在向LinkedIn发出任何请求之前,请确保您已登录。
为此,您的start\u URL
应该包含登录页面。
由于LinkedIn
上的登录表单未使用username
和password
字段,因此必须对其进行更改。您可以转到登录页面,在中
class LoginSpider(BaseSpider):
name = 'jiju'
start_urls = ['https://www.linkedin.com/uas/login']
def parse(self, response):
return FormRequest.from_response(response,
formdata={'session_key': 'your_login', 'session_password': 'your_pass'},
callback=self.after_login)
def after_login(self, response):
return Request(url="https://www.linkedin.com/groups/58888/members", callback=self.parse_members)