Json，刮入网页-python_Python_Json_Dictionary_Web Scraping

Json，刮入网页-python

python json dictionary web-scraping

Json，刮入网页-python,python,json,dictionary,web-scraping,Python,Json,Dictionary,Web Scraping,我正在使用python中的请求和beautifulsoup libs浏览某些网页所以我在这个简单的代码中得到了我想要的元素 <script> data = {'user':{'id':1,'name':'joe','age':18,'email':'joe@hotmail.com'}} </script> 数据={'user'：{'id'：1，'name'：'joe'，'age'：18，'email'：'joe@hotmail.com'}} 所以我想得到变量中的e

我正在使用python中的请求和beautifulsoup libs浏览某些网页

所以我在这个简单的代码中得到了我想要的元素

<script>
data = {'user':{'id':1,'name':'joe','age':18,'email':'joe@hotmail.com'}}
</script>


数据={'user'：{'id'：1，'name'：'joe'，'age'：18，'email'：'joe@hotmail.com'}}

所以我想得到变量中的email值但当我指定标记的文本时，整个元素返回到列表中我无法将其转换为json，因为列中有错误有什么想法吗？

我将感谢任何帮助

简单的东西，也许会帮助你

import json
from bs4 import BeautifulSoup

html = """
<script>
data = {'user':{'id':1,'name':'joe','age':18,'email':'joe@hotmail.com'}}
</script>
"""

soup = BeautifulSoup(html, 'html.parser')
# slices [7:] mean that we ignore the `data = `
# and replace the single quotes to double quotes for json.loads()
json_data = json.loads(soup.find('script').text.strip()[7:].replace("'", '"'))
print(json_data)
print(type(json_data))

导入json
从bs4导入BeautifulSoup
html=”“”
数据={'user'：{'id'：1，'name'：'joe'，'age'：18，'email'：'joe@hotmail.com'}}
"""
soup=BeautifulSoup（html，'html.parser'）
#切片[7:]表示我们忽略'data=`
#并将json.loads（）的单引号替换为双引号
json_data=json.loads（soup.find（'script'）.text.strip（）[7://替换（“，”））
打印（json_数据）
打印（类型（json_数据））

输出

{'user': {'id': 1, 'name': 'joe', 'age': 18, 'email': 'joe@hotmail.com'}}
<class 'dict'>

{'user'：{'id'：1，'name'：'joe'，'age'：18，'email'：'joe@hotmail.com'}}

u r越来越接近我想要的内容，我也这样做了，并在加载返回默认解码器解码文件的第354行的

code

json\u data=json.loads（soup.find\u all（'script'）[3].text.strip（）[21:]）列中给出了错误C:\Users\TOSHIBA\AppData\Local\Programs\Python36-32\lib\json\decoder.py”，第342行，在decode-raise-jsondecoderror（“额外数据”，s，end）json.decoder.jsondecodecorr：额外数据：第1行第3548列（char 3547）你能展示你想刮的

脚本标记吗？我不确定我是否能在这里做，因为它太长了。这可能会对你有所帮助，我认为你有很多dict
对象，这可能是因为结尾处有；
。使用类似[7:-1]
的东西，而不是[7:]
。此外，比使用这些神奇数字更可靠的方法是获取脚本标记内容中第一个{
和最后一个}
之间的所有内容，并将其解析为JSON。