Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 美搜网刮得孩子_Python 3.x_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 3.x 美搜网刮得孩子

Python 3.x 美搜网刮得孩子,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我正在使用BeautifulSoup抓取一个网站 CHN = "https://ncov.dxy.cn/ncovh5/view/pneumonia?scene=2&clicktime=1579582238&enterid=1579582238&from=singlemessage&isappinstalled=0" response3 = get(CHN, headers=headers) response3.encoding='utf-8' 从网站中删除所有内

我正在使用BeautifulSoup抓取一个网站

CHN = "https://ncov.dxy.cn/ncovh5/view/pneumonia?scene=2&clicktime=1579582238&enterid=1579582238&from=singlemessage&isappinstalled=0"
response3 = get(CHN, headers=headers)
response3.encoding='utf-8'
从网站中删除所有内容 html_soup3=beautifulsou(response3.text,'html.parser')

然后查找具有ad ID的脚本

scripts = html_soup3.find_all('script', id='getAreaStat')
print(scripts)


Out[64]: [<script id="getAreaStat">try { window.getAreaStat = [{"provinceName":"湖北省","provinceShortName":"湖北","currentConfirmedCount":2895,"confirmedCount":67801,"suspectedCount":0,"curedCount":61732,"deadCount":3174,"comment":"","locationId":420000,"statisticsData":"https://file1.dxycdn.com/2020/0223/618/3398299751673487511-135.json","cities":[{"cityName":"武汉","currentConfirmedCount":2880,"confirmedCount":50006,"suspectedCount":0,"curedCount":44591,"deadCount":2535,"locationId":420100},{"cityName":"孝感","currentConfirmedCount":4,"confirmedCount":3518,"suspectedCount":0,"curedCount":3386,"deadCount":128,"locationId":420900},
scripts=html\u soup3.find\u all('script',id='getAreaStat')
打印(脚本)
Out[64]:[try{window.getAreaStat=[{“provinceName”:”湖北省","provinceShortName:“”湖北","currentConfirmedCount:2895,“confirmedCount”:67801,“suspectedCount”:0,“CurvedCount”:61732,“deadCount”:3174,“comment:”,“locationId”:420000,“statisticsData:”https://file1.dxycdn.com/2020/0223/618/3398299751673487511-135.json,“城市”:[{“城市名称”:”武汉","currentConfirmedCount:2880,“confirmedCount”:50006,“suspectedCount”:0,“curedCount”:44591,“deadCount”:2535,“locationId”:420100},{“cityName”:孝感","currentConfirmedCount:4,“confirmedCount”:3518,“suspectedCount”:0,“curedCount”:3386,“deadCount”:128,“locationId”:420900},

我想知道如何才能获得一本包含provinceName及其子项的词典。

您可以将响应文本和正则表达式取出适当的字符串,并使用ast库转换为dict

import ast, re

#r = response text appropriately encoded
p = re.compile(r'window\.getAreaStat = \[(.*?)\]}catch')
data = p.findall(r)[0]
print(ast.literal_eval(data))
见正则表达式

说明:

更完整的示例(编码部分取自@宏杰李 ):


您可以发布url或
响应。text
?亲爱的bruvio-非常感谢BS提供的这个有趣的示例和任务。亲爱的bruvio-亲爱的QHarr-为了深入学习,最好提供一个包含代码的组合和收集解决方案。这将支持对解决方案的深入了解,对所有人都很好本帖的访客们。非常感谢亲爱的夸尔——感谢你们在这里的工作——感谢你们对SO的支持。这太棒了。继续努力吧——它rocks@bruvio请定义不起作用。正则表达式模式正确地选择了所有省份。@添加了零个完整示例。如果需要提供更多信息,请告诉我:-)@zero我想我已经为代码提供了重现问题所需的所有信息。我一直在尝试提取我所写的信息(provinceNames)。这对我理解如何解析我得到的字符串也很有帮助。或者是否有其他方法提取信息。例如,使用json(尝试).请继续努力,谢谢
import ast, re

#r = response text appropriately encoded
p = re.compile(r'window\.getAreaStat = \[(.*?)\]}catch')
data = p.findall(r)[0]
print(ast.literal_eval(data))
import requests, re, ast

res = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia?scene=2&clicktime=1579582238&enterid=1579582238&from=singlemessage&isappinstalled=0')
res.encoding = "GBK"
r = res.text
p = re.compile(r'window\.getAreaStat = \[(.*?)\]}catch')
data = p.findall(r)[0]
print(ast.literal_eval(data))