Python 在网页源代码中解析此深度嵌套的JSON对象

Python 在网页源代码中解析此深度嵌套的JSON对象,python,regex,json,python-requests,Python,Regex,Json,Python Requests,我正在尝试解析名为matchCentreData的项,该项可以在下一页的源代码中找到: 由于此页面上没有涉及XHR请求,并且数据项隐藏在页面源代码中,因此我不确定如何使用正则表达式以外的任何东西解析此项 因为数据结构嵌套得很深,所以我尝试将其分解为几个子组件来分别进行解析。下面是我的代码,尝试解析第一个子组件,playerIdNameDictionary: import json import simplejson import requests import jsonobject impor

我正在尝试解析名为
matchCentreData
的项,该项可以在下一页的源代码中找到:

由于此页面上没有涉及XHR请求,并且数据项隐藏在页面源代码中,因此我不确定如何使用正则表达式以外的任何东西解析此项

因为数据结构嵌套得很深,所以我尝试将其分解为几个子组件来分别进行解析。下面是我的代码,尝试解析第一个子组件,
playerIdNameDictionary

import json
import simplejson
import requests
import jsonobject
import time
import re

url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'
params = {}

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}


responser = requests.get(url, params=params, headers=headers)

regex = re.compile("matchCentreData = \{.*?\};", re.S)
match = re.search(regex, responser.text)
match2 = match.group()

match3 = match2[u'playerIdNameDictionary']
print match3
但是,这会产生以下错误:

Traceback (most recent call last):
  File "C:\Python27\counter.py", line 23, in <module>
    match3 = match2[u'playerIdNameDictionary']
TypeError: string indices must be integers
回溯(最近一次呼叫最后一次):
文件“C:\Python27\counter.py”,第23行,在
match3=match2[u'playerIdNameDictionary']
TypeError:字符串索引必须是整数
我认为这是因为我返回的项是字符串,而不是JSON对象。我想知道的是:

1) 我对上述句子中所述问题的诊断正确吗? 2) 如何在不使用正则表达式的情况下解析JSON/javascript对象
matchCentreData

我希望我的问题有意义


谢谢,match2只是一个字符串,不是json对象。您可以使用
match2=json.load(match2)
将字符串转换为json对象。请将
json.loads
调用包装在try/catch块中,以捕获源json中的错误

更多关于
json.loads()


正如我在下面的评论中所说的,您的regexp有点太松散了。当它找到
var matchCentreData={…
时,它将开始匹配,但它将继续匹配,直到
response.text
中的最后一个json blob完成。这不是json.loads可以处理的。我已将代码更改为:

>>> regex = re.compile("var matchCentreData = (\{.+\});\r\n        var matchCentreEventTypeJson", re.S)
>>> match = re.search(regex, response.text)
>>> # now match.groups(1)[0] will contain the match centre data json blob
>>> match_centre_data = json.loads(match.groups(1)[0])
>>> match_centre_data['playerIdNameDictionary']['34693']
'Marko Arnautovic'

请注意,这种形式的编码非常脆弱,当whoscores.com更新其网站时可能会中断。

年轻人可以使用beautifulsoup提取脚本:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
data = soup.find("script",text=data_cen).text
d = json.dumps(data_cen.search(data).group(1))
data_dict  = (json.loads(d))
{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}
您还可以使用find_next和类似的正则表达式来查找脚本,以提取所需的数据:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')

data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))

data_dict = json.loads(d)
event_dict = json.loads(e)

{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}
{"shotSixYardBox":0,"shotPenaltyArea":1,"shotOboxTotal":2,"shotOpenPlay":3,"shotCounter":4,"shotSetPiece":5,"shotOffTarget":6,"shotOnPost":7,"shotOnTarget":8,"shotsTotal":9,"shotBlocked":10,"shotRightFoot":11,"shotLeftFoot":12,"shotHead":13,"shotObp":14,"goalSixYardBox":15,"goalPenaltyArea":16,"goalObox":17,"goalOpenPlay":18,"goalCounter":19,"goalSetPiece":20,"penaltyScored":21,"goalOwn":22,"goalNormal":23,"goalRightFoot":24,"goalLeftFoot":25,"goalHead":26,"goalObp":27,"shortPassInaccurate":28,"shortPassAccurate":29,"passCorner":30,"passCornerAccurate":31,"passCornerInaccurate":32,"passFreekick":33,"passBack":34,"passForward":35,"passLeft":36,"passRight":37,"keyPassLong":38,"keyPassShort":39,"keyPassCross":40,"keyPassCorner":41,"keyPassThroughball":42,"keyPassFreekick":43,"keyPassThrowin":44,"keyPassOther":45,"assistCross":46,"assistCorner":47,"assistThroughball":48,"assistFreekick":49,"assistThrowin":50,"assistOther":51,"dribbleLost":52,"dribbleWon":53,"challengeLost":54,"interceptionWon":55,"clearanceHead":56,"outfielderBlock":57,"passCrossBlockedDefensive":58,"outfielderBlockedPass":59,"offsideGiven":60,"offsideProvoked":61,"foulGiven":62,"foulCommitted":63,"yellowCard":64,"voidYellowCard":65,"secondYellow":66,"redCard":67,"turnover":68,"dispossessed":69,"saveLowLeft":70,"saveHighLeft":71,"saveLowCentre":72,"saveHighCentre":73,"saveLowRight":74,"saveHighRight":75,"saveHands":76,"saveFeet":77,"saveObp":78,"saveSixYardBox":79,"savePenaltyArea":80,"saveObox":81,"keeperDivingSave":82,"standingSave":83,"closeMissHigh":84,"closeMissHighLeft":85,"closeMissHighRight":86,"closeMissLeft":87,"closeMissRight":88,"shotOffTargetInsideBox":89,"touches":90,"assist":91,"ballRecovery":92,"clearanceEffective":93,"clearanceTotal":94,"clearanceOffTheLine":95,"dribbleLastman":96,"errorLeadsToGoal":97,"errorLeadsToShot":98,"intentionalAssist":99,"interceptionAll":100,"interceptionIntheBox":101,"keeperClaimHighLost":102,"keeperClaimHighWon":103,"keeperClaimLost":104,"keeperClaimWon":105,"keeperOneToOneWon":106,"parriedDanger":107,"parriedSafe":108,"collected":109,"keeperPenaltySaved":110,"keeperSaveInTheBox":111,"keeperSaveTotal":112,"keeperSmother":113,"keeperSweeperLost":114,"keeperMissed":115,"passAccurate":116,"passBackZoneInaccurate":117,"passForwardZoneAccurate":118,"passInaccurate":119,"passAccuracy":120,"cornerAwarded":121,"passKey":122,"passChipped":123,"passCrossAccurate":124,"passCrossInaccurate":125,"passLongBallAccurate":126,"passLongBallInaccurate":127,"passThroughBallAccurate":128,"passThroughBallInaccurate":129,"passThroughBallInacurate":130,"passFreekickAccurate":131,"passFreekickInaccurate":132,"penaltyConceded":133,"penaltyMissed":134,"penaltyWon":135,"passRightFoot":136,"passLeftFoot":137,"passHead":138,"sixYardBlock":139,"tackleLastMan":140,"tackleLost":141,"tackleWon":142,"cleanSheetGK":143,"cleanSheetDL":144,"cleanSheetDC":145,"cleanSheetDR":146,"cleanSheetDML":147,"cleanSheetDMC":148,"cleanSheetDMR":149,"cleanSheetML":150,"cleanSheetMC":151,"cleanSheetMR":152,"cleanSheetAML":153,"cleanSheetAMC":154,"cleanSheetAMR":155,"cleanSheetFWL":156,"cleanSheetFW":157,"cleanSheetFWR":158,"cleanSheetSub":159,"goalConcededByTeamGK":160,"goalConcededByTeamDL":161,"goalConcededByTeamDC":162,"goalConcededByTeamDR":163,"goalConcededByTeamDML":164,"goalConcededByTeamDMC":165,"goalConcededByTeamDMR":166,"goalConcededByTeamML":167,"goalConcededByTeamMC":168,"goalConcededByTeamMR":169,"goalConcededByTeamAML":170,"goalConcededByTeamAMC":171,"goalConcededByTeamAMR":172,"goalConcededByTeamFWL":173,"goalConcededByTeamFW":174,"goalConcededByTeamFWR":175,"goalConcededByTeamSub":176,"goalConcededOutsideBoxGoalkeeper":177,"goalScoredByTeamGK":178,"goalScoredByTeamDL":179,"goalScoredByTeamDC":180,"goalScoredByTeamDR":181,"goalScoredByTeamDML":182,"goalScoredByTeamDMC":183,"goalScoredByTeamDMR":184,"goalScoredByTeamML":185,"goalScoredByTeamMC":186,"goalScoredByTeamMR":187,"goalScoredByTeamAML":188,"goalScoredByTeamAMC":189,"goalScoredByTeamAMR":190,"goalScoredByTeamFWL":191,"goalScoredByTeamFW":192,"goalScoredByTeamFWR":193,"goalScoredByTeamSub":194,"aerialSuccess":195,"duelAerialWon":196,"duelAerialLost":197,"offensiveDuel":198,"defensiveDuel":199,"bigChanceMissed":200,"bigChanceScored":201,"bigChanceCreated":202,"overrun":203,"successfulFinalThirdPasses":204,"punches":205,"penaltyShootoutScored":206,"penaltyShootoutMissedOffTarget":207,"penaltyShootoutSaved":208,"penaltyShootoutSavedGK":209,"penaltyShootoutConcededGK":210,"throwIn":211,"subOn":212,"subOff":213,"defensiveThird":214,"midThird":215,"finalThird":216,"pos":217}
完整代码:

import json
import requests
import re

url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}


r = requests.get(url,  headers=headers)


from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')

data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))

data_dict = json.loads(d)
event_dict = json.loads(e)
print(event_dict)
print(data_dict)

{"shotSixYardBox":0,"shotPenaltyArea":1,"shotOboxTotal":2,"shotOpenPlay":3,"shotCounter":4,"shotSetPiece":5,"shotOffTarget":6,"shotOnPost":7,"shotOnTarget":8,"shotsTotal":9,"shotBlocked":10,"shotRightFoot":11,"shotLeftFoot":12,"shotHead":13,"shotObp":14,"goalSixYardBox":15,"goalPenaltyArea":16,"goalObox":17,"goalOpenPlay":18,"goalCounter":19,"goalSetPiece":20,"penaltyScored":21,"goalOwn":22,"goalNormal":23,"goalRightFoot":24,"goalLeftFoot":25,"goalHead":26,"goalObp":27,"shortPassInaccurate":28,"shortPassAccurate":29,"passCorner":30,"passCornerAccurate":31,"passCornerInaccurate":32,"passFreekick":33,"passBack":34,"passForward":35,"passLeft":36,"passRight":37,"keyPassLong":38,"keyPassShort":39,"keyPassCross":40,"keyPassCorner":41,"keyPassThroughball":42,"keyPassFreekick":43,"keyPassThrowin":44,"keyPassOther":45,"assistCross":46,"assistCorner":47,"assistThroughball":48,"assistFreekick":49,"assistThrowin":50,"assistOther":51,"dribbleLost":52,"dribbleWon":53,"challengeLost":54,"interceptionWon":55,"clearanceHead":56,"outfielderBlock":57,"passCrossBlockedDefensive":58,"outfielderBlockedPass":59,"offsideGiven":60,"offsideProvoked":61,"foulGiven":62,"foulCommitted":63,"yellowCard":64,"voidYellowCard":65,"secondYellow":66,"redCard":67,"turnover":68,"dispossessed":69,"saveLowLeft":70,"saveHighLeft":71,"saveLowCentre":72,"saveHighCentre":73,"saveLowRight":74,"saveHighRight":75,"saveHands":76,"saveFeet":77,"saveObp":78,"saveSixYardBox":79,"savePenaltyArea":80,"saveObox":81,"keeperDivingSave":82,"standingSave":83,"closeMissHigh":84,"closeMissHighLeft":85,"closeMissHighRight":86,"closeMissLeft":87,"closeMissRight":88,"shotOffTargetInsideBox":89,"touches":90,"assist":91,"ballRecovery":92,"clearanceEffective":93,"clearanceTotal":94,"clearanceOffTheLine":95,"dribbleLastman":96,"errorLeadsToGoal":97,"errorLeadsToShot":98,"intentionalAssist":99,"interceptionAll":100,"interceptionIntheBox":101,"keeperClaimHighLost":102,"keeperClaimHighWon":103,"keeperClaimLost":104,"keeperClaimWon":105,"keeperOneToOneWon":106,"parriedDanger":107,"parriedSafe":108,"collected":109,"keeperPenaltySaved":110,"keeperSaveInTheBox":111,"keeperSaveTotal":112,"keeperSmother":113,"keeperSweeperLost":114,"keeperMissed":115,"passAccurate":116,"passBackZoneInaccurate":117,"passForwardZoneAccurate":118,"passInaccurate":119,"passAccuracy":120,"cornerAwarded":121,"passKey":122,"passChipped":123,"passCrossAccurate":124,"passCrossInaccurate":125,"passLongBallAccurate":126,"passLongBallInaccurate":127,"passThroughBallAccurate":128,"passThroughBallInaccurate":129,"passThroughBallInacurate":130,"passFreekickAccurate":131,"passFreekickInaccurate":132,"penaltyConceded":133,"penaltyMissed":134,"penaltyWon":135,"passRightFoot":136,"passLeftFoot":137,"passHead":138,"sixYardBlock":139,"tackleLastMan":140,"tackleLost":141,"tackleWon":142,"cleanSheetGK":143,"cleanSheetDL":144,"cleanSheetDC":145,"cleanSheetDR":146,"cleanSheetDML":147,"cleanSheetDMC":148,"cleanSheetDMR":149,"cleanSheetML":150,"cleanSheetMC":151,"cleanSheetMR":152,"cleanSheetAML":153,"cleanSheetAMC":154,"cleanSheetAMR":155,"cleanSheetFWL":156,"cleanSheetFW":157,"cleanSheetFWR":158,"cleanSheetSub":159,"goalConcededByTeamGK":160,"goalConcededByTeamDL":161,"goalConcededByTeamDC":162,"goalConcededByTeamDR":163,"goalConcededByTeamDML":164,"goalConcededByTeamDMC":165,"goalConcededByTeamDMR":166,"goalConcededByTeamML":167,"goalConcededByTeamMC":168,"goalConcededByTeamMR":169,"goalConcededByTeamAML":170,"goalConcededByTeamAMC":171,"goalConcededByTeamAMR":172,"goalConcededByTeamFWL":173,"goalConcededByTeamFW":174,"goalConcededByTeamFWR":175,"goalConcededByTeamSub":176,"goalConcededOutsideBoxGoalkeeper":177,"goalScoredByTeamGK":178,"goalScoredByTeamDL":179,"goalScoredByTeamDC":180,"goalScoredByTeamDR":181,"goalScoredByTeamDML":182,"goalScoredByTeamDMC":183,"goalScoredByTeamDMR":184,"goalScoredByTeamML":185,"goalScoredByTeamMC":186,"goalScoredByTeamMR":187,"goalScoredByTeamAML":188,"goalScoredByTeamAMC":189,"goalScoredByTeamAMR":190,"goalScoredByTeamFWL":191,"goalScoredByTeamFW":192,"goalScoredByTeamFWR":193,"goalScoredByTeamSub":194,"aerialSuccess":195,"duelAerialWon":196,"duelAerialLost":197,"offensiveDuel":198,"defensiveDuel":199,"bigChanceMissed":200,"bigChanceScored":201,"bigChanceCreated":202,"overrun":203,"successfulFinalThirdPasses":204,"punches":205,"penaltyShootoutScored":206,"penaltyShootoutMissedOffTarget":207,"penaltyShootoutSaved":208,"penaltyShootoutSavedGK":209,"penaltyShootoutConcededGK":210,"throwIn":211,"subOn":212,"subOff":213,"defensiveThird":214,"midThird":215,"finalThird":216,"pos":217}
{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}

您好,谢谢您的回复。我已经尝试了您的回复。我收到一个错误,上面写着“ValueError:无法解码任何JSON对象”。在这种情况下,请检查match.group()返回的内容。我猜您的正则表达式中有错误。请检查您是否只捕获了
var matchCentreData=-->{…}你到底想得到什么?@PadraicCunningham再次问好。这个网站的大部分功能似乎都是基于XHR请求的,但令人恼火的是,有些页面没有,JSON风格的对象被嵌入到页面的源代码中。在这种情况下,除了使用regex。我想知道如何在源代码中将此对象作为json/javascript项引用,然后知道如何引用“matchCentreData”的子组件。例如,第一个子组件称为“playerIdNameDictionary”。如果没有意义,请告诉我……在上面的示例中,我得到一个错误,没有定义“r”。是不是这意味着什么?在你的第二个例子中,我不清楚以“data=”开头的行在做什么?data正在查找包含联系方式信息的a标记之后的下一个脚本标记。如果你查看源代码,你会看到它就在你想要的脚本之前。好的,这就快到了。第二个正则表达式“event_type”不是fin但是正在进行匹配。由于数据对象位于“matchCentreEventTypeJson=”的新行上,是否需要添加“\n”?谢谢。我添加了我正在使用的完整代码,它返回了您看到的两个DICT回溯(最近一次调用):文件“C:\Python27\counter.py”,第23行,在e=json.dumps中(事件类型.search(data.group(1))AttributeError:“非类型”对象没有属性“组”