如何在python中提取特定javascript标记中的内容？_Python_Beautifulsoup

如何在python中提取特定javascript标记中的内容？

python

如何在python中提取特定javascript标记中的内容？,python,beautifulsoup,Python,Beautifulsoup,我正试图从中提取Dota2 TI9国际项目的全部时间表和结果。我查找的信息位于标签和“附表数据”下到目前为止，这就是我得到的 import requests, re, json from bs4 import BeautifulSoup as bs url = 'http://www.dota2.com/international/schedule/0/0/?l=english' page = requests.get(url) soup = bs(page.text,'html.parse

我正试图从中提取Dota2 TI9国际项目的全部时间表和结果。我查找的信息位于标签和“附表数据”下

到目前为止，这就是我得到的

import requests, re, json
from bs4 import BeautifulSoup as bs
url = 'http://www.dota2.com/international/schedule/0/0/?l=english'
page = requests.get(url)
soup = bs(page.text,'html.parser')
all_javascript = soup.find_all(name='script',type='text/javascript')
all_javascript[:] = [x for x in all_javascript if(re.search("schedule_data",x.text))]  
data = all_javascript[0]
new_data = json.loads(data.text)

我找到所有的“脚本”标记，然后搜索“schedule_data”模式，以确定我需要的标记。但是，现在最后一行失败并出现错误

new_data = json.loads(data.text)
Traceback (most recent call last):

  File "<ipython-input-68-447d26a16d5b>", line 1, in <module>
    new_data = json.loads(data.text)

  File "C:\Users\templ\Anaconda3\lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)

  File "C:\Users\templ\Anaconda3\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())

  File "C:\Users\templ\Anaconda3\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

JSONDecodeError: Expecting value

返回str，据我所知，str是json加载的正确数据类型

请帮助

请查看此文件

import requests, re, json
import ast

from bs4 import BeautifulSoup as bs
url = 'http://www.dota2.com/international/schedule/0/0/?l=english'
page = requests.get(url)
soup = bs(page.text,'html.parser')
all_javascript = soup.find_all(name='script',type='text/javascript')

for x in all_javascript:
    if re.search("schedule_data", x.text):
        data = str(x).split("$( '#ScheduleArea' ).tournamentSchedule(")[1].split(');')[0].strip().replace('\n', '').replace('\t', '').replace('\r', '')
        data_dict = ast.literal_eval(data)
        print(data_dict['schedule_data'])

你能告诉我你需要什么样的输出吗？为什么

都是javascript[：]=

？请共享相关HTML以及程序中的值，请参阅。我需要提取比赛ID、开始时间和结束时间以及比赛的赢家。我正在尝试从该源获取比赛ID、赢家和输家，并使用{match_ID}获取相关的比赛开始和结束时间。我想把这一切都放到一个漂亮的excel中，看起来像这样：[比赛ID，开始时间（EDT），日期，日期，比赛长度，a队，B队，获胜者]谢谢：）这很有效。我开始阅读关于抽象语法树的内容，但我不确定自己是否完全理解代码的作用。是否有此项的ELI来源？没问题，只要文本在if搜索中匹配（“schedule_data”，x.text）：。我正在尝试基于$（“#ScheduleArea”）.tournamentSchedule（它包含您需要的dict，因此我将获取索引1的数据，然后再次拆分到*$（“#ScheduleArea”）.tournamentSchedule（右括号，因此拆分基于）；然后只是替换它。@vaidyanathanviswanathansanana如果您觉得这个答案可以接受，您应该接受它，因为响应者花费了时间和精力来整理答案。

import requests, re, json
import ast

from bs4 import BeautifulSoup as bs
url = 'http://www.dota2.com/international/schedule/0/0/?l=english'
page = requests.get(url)
soup = bs(page.text,'html.parser')
all_javascript = soup.find_all(name='script',type='text/javascript')

for x in all_javascript:
    if re.search("schedule_data", x.text):
        data = str(x).split("$( '#ScheduleArea' ).tournamentSchedule(")[1].split(');')[0].strip().replace('\n', '').replace('\t', '').replace('\r', '')
        data_dict = ast.literal_eval(data)
        print(data_dict['schedule_data'])