Python 从html中提取数组元素

Python 从html中提取数组元素,python,json,beautifulsoup,Python,Json,Beautifulsoup,我正在使用urlopen和beautifulsoup4获取网页的内容。 我正在获取的网页会生成一些动态javascript块。 我想提取整个数组的内容 数组的格式如下: <script type="text/javascript"> var jobmap = {}; jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f'

我正在使用urlopen和beautifulsoup4获取网页的内容。 我正在获取的网页会生成一些动态javascript块。 我想提取整个数组的内容

数组的格式如下:

<script type="text/javascript">
var jobmap = {};
jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'};
jobmap[1]= {jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'};
</script>

var jobmap={};
作业映射[0]={jk:'929a2508c8bf2c9c',efccid:'28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'BE1C2A33DB344744F',num:'0',srcname:'Oshawa城市',cmpesc:'Oshawa城市',cmplnk:'安大略省的Oshawa就业城市',loc:'Oshawa,国家:'CA',邮政编码:'CA',城市:'Oshawa',标题:'Systems Analyst',locid:'DA5CA320FE5',locid:'DA5CA3BUK6W:'LtCZmZLj2Y-bGYlQI'};
作业地图[1]={jk:'2d06bbaac441e7d2',efccid:'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.,cmpesc:'FGL Sports Ltd.,cmplnk:'/FGL Sports jobs in Ontario',loc:'Ontario',country:'CA',zip:'city:'city:'头衔:'Decision Support分析员',locid:'8BACCC5BF 5F0017',rd:'JJYG7FDE_Ia4YknbAcijYgE'};
数组包含未知数量的元素。
如何提取整个数组的内容并将其保存到json对象中?

BeautifulSoup
只能帮助解决问题的一部分—定位包含所需对象的所需
脚本
元素。然后,您需要使用javascript解析器,例如,或正则表达式g沿着这些思路:

import json
import re
from bs4 import BeautifulSoup


data = """
<script type="text/javascript">
var jobmap = {};
jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'};
jobmap[1]= {jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'};
</script>"""

soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=lambda text: "var jobmap" in text)

pattern = re.compile(r"jobmap\[\d+\]\s*=\s*({.*?})")
for item in pattern.findall(script.get_text(), re.MULTILINE):
    print(item)
请注意,每个
值都不能通过
json.loads()
直接加载,请研究使用或其他方式将javascript对象字符串加载到Python字典中:


这不是一个数组。如果你有动态内容,Beautilsoup和urlopen是解决问题的错误方法problem@cricket_007我认为这要看情况而定,有时候javascript内容会出现在HTML中(通常是脚本标记),所以选择“简单”是有意义的urlopen/requests方法避免了基于浏览器或基于javascript引擎的方法的开销和缓慢。不过,这里的情况通常更脆弱。这可能不是严格意义上的“错误”,而是更像是“谨慎使用和理解”:@alecxe Fair points:)
{jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'}
{jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'}