Python 从json中提取文本<;脚本>;当存在多个JSON时标记
我试图从中提取评级,以便从HTML代码中提取“ratingValue”和“alternateName”字段:Python 从json中提取文本<;脚本>;当存在多个JSON时标记,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图从中提取评级,以便从HTML代码中提取“ratingValue”和“alternateName”字段: <script type=application/ld+json>{ "@context": "http://schema.org", "@type": "ClaimReview", "datePublished": "2019-01-03 ", "url": "https://www.truthorfiction.com/are-americans-annually-hea
<script type=application/ld+json>{
"@context": "http://schema.org",
"@type": "ClaimReview",
"datePublished": "2019-01-03 ",
"url": "https://www.truthorfiction.com/are-americans-annually-healthcare-undocumented/",
"author": {
"@type": "Organization",
"url": "https://www.truthorfiction.com/",
"image": "https://dn.truthorfiction.com/wp-content/uploads/2018/10/25032229/truth-or-fiction-logo-tagline.png",
"sameAs": "https://twitter.com/whatstruecom"
},
"claimReviewed": "More Americans die every year from a lack of affordable healthcare than by terrorism or at the hands of undocumented immigrants.",
"reviewRating": {
"@type": "Rating",
"ratingValue": -1,
"worstRating":-1,
"bestRating": -1,
"alternateName": "True"
},
"itemReviewed": {
"@type": "CreativeWork",
"author": {
"@type": "Person",
"name": "Person",
"jobTitle": "",
"image": "",
"sameAs": [
""
]
},
"datePublished": "",
"name": ""
}
}</script>
然而,tmp显示了一个“application/ld+json”项的字典,它来自我想要提取的评级之前的位,我想知道如何循环或循环到脚本中存储评级的相关部分。您需要使用键访问元素
rating_value = tmp['reviewRating']['ratingValue'] # -1
alternate_name = tmp['reviewRating']['alternateName'] # 'True'
或
它有2个
您可以从find\u all()
或循环并搜索是否包含字符串
tmp = None
for ldjson in soup.find_all('script', type='application/ld+json'):
if 'ratingValue' in ldjson.text:
tmp = json.loads(ldjson.text)
我的问题是soup.find找到的第一个json被分配给tmp,而不是找到rating_值和alternate_名称的json,我不确定如何将相关json加载到tmp中
KeyError:“reviewRating
您可以使用find_all()
方法,然后遍历标记集合:all_scripts=soup.find_all('script')
。
review_rating = tmp['reviewRating']
rating_value = review_rating['ratingValue'] # -1
alternate_name = review_rating['alternateName'] # 'True'
tmp = json.loads(soup.find_all('script', type='application/ld+json')[1].text)
tmp = None
for ldjson in soup.find_all('script', type='application/ld+json'):
if 'ratingValue' in ldjson.text:
tmp = json.loads(ldjson.text)