Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/345.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy正在从其他网页返回内容_Python_Python 3.x_Web Scraping_Scrapy - Fatal编程技术网

Python Scrapy正在从其他网页返回内容

Python Scrapy正在从其他网页返回内容,python,python-3.x,web-scraping,scrapy,Python,Python 3.x,Web Scraping,Scrapy,我试图从Tapology.com上搜集战斗数据,但我通过Scrapy获取的内容给了我一个完全不同的网页内容。例如,我想从以下链接中提取战斗机名称: 因此,我用以下方法打开scrapy shell: scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii' 然后,我尝试使用以下代码提取战斗机名称: re

我试图从Tapology.com上搜集战斗数据,但我通过Scrapy获取的内容给了我一个完全不同的网页内容。例如,我想从以下链接中提取战斗机名称:

因此,我用以下方法打开scrapy shell:

scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
然后,我尝试使用以下代码提取战斗机名称:

response.css('.fighterNames ::text').getall()
我将此作为一个reslut:

['\n',, “\n”, “\n”, “比利·阿亚什”, “\n”, “\n”, “\n”, “丹尼斯·里德”, “\n”, “\n”, “\n”, “\n”, "惩罚者",, “\n”, “\n”, “\n”]

正如您在网页上看到的,如果您检查HTML,返回的名称应该是“Robbie Lawler”和“Rory MacDonald”。更奇怪的是,Scrapy每次在shell环境中测试此网页时都返回不同的内容。它不会总是返回Billy Ayash和Dennis Reed的战斗网页内容

刮痧有什么问题吗?Tapology.com有什么问题吗?任何帮助都将不胜感激!我在ufcstats.com上使用了Scrapy,在测试前后都没有任何问题

以下是完整的代码:

(base) davidwismer@Davids-MacBook-Pro ~ % scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.7-x86_64-i386-64bit
2021-03-03 17:18:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-03 17:18:03 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2021-03-03 17:18:03 [scrapy.extensions.telnet] INFO: Telnet Password: b44d20b5d1bbeb73
2021-03-03 17:18:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-03 17:18:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-03 17:18:04 [scrapy.core.engine] INFO: Spider opened
2021-03-03 17:18:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii> (referer: None)
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fc4d97c5730>
[s]   item       {}
[s]   request    <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s]   response   <200 https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s]   settings   <scrapy.settings.Settings object at 0x7fc4d97c5e50>
[s]   spider     <DefaultSpider 'default' at 0x7fc4d9e26100>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
In [1]: response.css('.fighterNames ::text').getall()
Out[1]: 
['\n',
 '\n',
 '\n',
 'Billy Ayash',
 '\n',
 '\n',
 '\n',
 'Dennis Reed',
 '\n',
 '\n',
 '\n',
 '\n',
 '"The Punisher"',
 '\n',
 '\n',
 '\n']
(基本)davidwismer@Davids-MacBook Pro ~%scrapy shell'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
2021-03-03 17:18:03[scrapy.utils.log]信息:scrapy 2.4.1已启动(bot:scrapybot)
2021-03-03 17:18:03[scrapy.utils.log]信息:版本:lxml 4.6.1.0,libxml2.9.10,csselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.8.5(默认,2020年9月4日,02:22:02)-[clang10.0.0],pyOpenSSL 19.1.0(OpenSSL 1.1.1h,2020年9月22日),密码学3.1,平台macOS-10.15.7-64位
2021-03-03 17:18:03[scrapy.utils.log]调试:使用reactor:twisted.internet.selectreactor.selectreactor
2021-03-03 17:18:03[刮屑爬虫]信息:覆盖的设置:
{'DUPEFILTER_CLASS':'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL':0}
2021-03-03 17:18:03[scrapy.extensions.telnet]信息:telnet密码:b44d20b5d1bbeb73
2021-03-03 17:18:03[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.corestats.corestats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2021-03-03 17:18:04[scrapy.middleware]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2021-03-03 17:18:04[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2021-03-03 17:18:04[scrapy.middleware]信息:启用的项目管道:
[]
2021-03-03 17:18:04[scrapy.extensions.telnet]信息:telnet控制台监听127.0.0.1:6023
2021-03-03 17:18:04[刮屑芯发动机]信息:十字轴已打开
2021-03-03 17:18:05[刮屑核心引擎]调试:爬网(200)(参考:无)
2021-03-03 17:18:05[asyncio]调试:使用选择器:KqueueSelector
[s] 可用的刮擦对象:
[s] scrapy scrapy模块(包含scrapy.Request、scrapy.Selector等)
[s] 爬虫
[s] 项目{}
[s] 请求
[s] 回应
[s] 背景
[s] 蜘蛛
[s] 有用的快捷方式:
[s] 获取(url[,redirect=True])获取url并更新本地对象(默认情况下,遵循重定向)
[s] 获取(req)获取碎片。请求并更新本地对象
[s] shelp()Shell帮助(打印此帮助)
[s] 查看(响应)在浏览器中查看响应
2021-03-03 17:18:05[asyncio]调试:使用选择器:KqueueSelector
[1]中的response.css('.fighterNames::text').getall()
出[1]:
['\n',,
“\n”,
“\n”,
“比利·阿亚什”,
“\n”,
“\n”,
“\n”,
“丹尼斯·里德”,
“\n”,
“\n”,
“\n”,
“\n”,
"惩罚者",,
“\n”,
“\n”,
“\n”]

我用
请求测试了它,得到了相同的结果

但是,当我将
用户代理
标题设置为其他值(以下示例中的值取自我的web浏览器)时,我得到了有效的结果。代码如下:

从请求导入获取
从bs4导入BeautifulSoup
def get_名称(使用_用户_代理:bool):
如果使用用户代理:
headers={'User-Agent':'Mozilla/5.0(X11;Linux x86_64;rv:86.0)Gecko/20100101 Firefox/86.0'}
其他:
标题={}
r=获取('https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii,headers=headers)
r、 为_状态()提出_
soup=BeautifulSoup(r.text,features='html.parser')
名称=汤。选择(“.fighterNames span”)
打印('名称:')
对于名称中的n:
打印(n.text.strip())
打印('--')
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
打印('不带用户代理:')
对于范围(3)中的i:
获取_名称(False)
打印(“\n带用户代理:”)
对于范围(3)中的i:
获取_名称(True)
输出:

Without user agent:
Names:
Jared Downing
Danny Tims
"Demon Eyes"

---
Names:
Allen Hope
Mike Kent
"Bunzy"

---
Names:
Paweł Sikora
Patryk Domke
"Ponczek"
"Patrykos"
---

With user agent:
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---

谢谢这解决了我的问题。在ScrapyShell中,我不提供任何用户代理信息。但在实际的spider代码中,我确实提供了这些信息。在spider和boom中运行,会显示正确的内容。我确实发现这个特殊的网站可能有一些防刮措施。