Python 如何从网页的JSON/Javascript中提取数据?

Python 如何从网页的JSON/Javascript中提取数据?,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,我是Python新手,今天就开始吧。 我的系统环境是python3.5,在Windows10上有一些库 我想从下面的网站中提取足球运动员数据作为CSV文件 问题:我无法从汤中提取数据。查找所有('script')[17]到我期望的CSV格式。如何提取我想要的数据 我的代码如下所示 from bs4 import BeautifulSoup import re from urllib.request import Request, urlopen req = Request('http://ww

我是Python新手,今天就开始吧。
我的系统环境是
python3.5
,在
Windows10
上有一些库

我想从下面的网站中提取足球运动员数据作为CSV文件

问题:我无法从
汤中提取数据。查找所有('script')[17]
到我期望的CSV格式。如何提取我想要的数据

我的代码如下所示

from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th
我的预期输出与此类似

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik

所以我的理解是beautifulsoup更适合于HTML解析,但您正在尝试解析嵌套在HTML中的javascript

所以你有两个选择

  • 只需创建一个函数,该函数接受soup.find_all('script')[17],循环并手动搜索字符串中的数据并提取它。您甚至可以使用ast.literal\u eval(string\u that\u really\u a\u dictionary)使其更加简单。这可能不是最好的方法,但如果您是python新手,您可能希望这样做只是为了练习
  • 或者这可能是更好的方法

  • 所以我的理解是beautifulsoup更适合于HTML解析,但您正在尝试解析嵌套在HTML中的javascript

    所以你有两个选择

  • 只需创建一个函数,该函数接受soup.find_all('script')[17],循环并手动搜索字符串中的数据并提取它。您甚至可以使用ast.literal\u eval(string\u that\u really\u a\u dictionary)使其更加简单。这可能不是最好的方法,但如果您是python新手,您可能希望这样做只是为了练习
  • 或者这可能是更好的方法

  • 正如@josiah Swain所说,这不会是一件美好的事情。对于这类事情,更推荐使用JS,因为它可以理解您所拥有的

    这么说来,python真是太棒了,下面是您的解决方案

    #Same imports as before
    from bs4 import BeautifulSoup
    import re
    from urllib.request import Request, urlopen
    
    #And one more
    import json
    
    # The code you had 
    req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
                   headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage,'html.parser')
    
    # Store the script 
    script = soup.find_all('script')[17]
    
    # Extract the oneline that stores all that JSON
    uncleanJson = [line for line in script.text.split('\n') 
             if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
    
    # The easiest way to strip away all that yucky JS to get to the JSON
    cleanJSON = uncleanJson.lstrip() \
                           .replace('squad.register_players($.parseJSON(\'', '') \
                           .replace('\'));','')
    
    # Extract out that useful info
    data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
             for p in json.loads(cleanJSON)
             if p['player'] is not None]
    
    
    print('position,slot_position,slug')
    for line in data:
        print(','.join(line))
    
    我将其复制并粘贴到python中得到的结果是:

    position,slot_position,slug
    ST,ST,paulo-henrique
    LM,LM,mugdat-celik
    CAM,CAM,soner-aydogdu
    RM,RM,petar-grbic
    GK,GK,fatih-ozturk
    CDM,CDM,eray-ataseven
    LB,LB,kadir-keles
    CB,CB,caner-osmanpasa
    CB,CB,mustafa-yumlu
    RM,RM,ioan-adrian-hora
    GK,GK,bora-kork
    

    编辑:对于初学者来说,这不是最容易阅读的代码。这是一个更容易阅读的版本

    # ... All that previous code 
    script = soup.find_all('script')[17]
    
    allScriptLines = script.text.split('\n')
    
    uncleanJson = None
    for line in allScriptLines:
         # Remove left whitespace (makes it easier to parse)
         cleaner_line = line.lstrip()
         if cleaner_line.startswith('squad.register_players($.parseJSON'):
              uncleanJson = cleaner_line
    
    cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
    
    print('position,slot_position,slug')
    for player in json.loads(cleanJSON):
         if player['player'] is not None:
             print(player['position'],player['data']['slot_position'],player['data']['slug']) 
    

    正如@josiah Swain所说,这不会是一件美好的事情。对于这类事情,更推荐使用JS,因为它可以理解您所拥有的

    这么说来,python真是太棒了,下面是您的解决方案

    #Same imports as before
    from bs4 import BeautifulSoup
    import re
    from urllib.request import Request, urlopen
    
    #And one more
    import json
    
    # The code you had 
    req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
                   headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage,'html.parser')
    
    # Store the script 
    script = soup.find_all('script')[17]
    
    # Extract the oneline that stores all that JSON
    uncleanJson = [line for line in script.text.split('\n') 
             if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
    
    # The easiest way to strip away all that yucky JS to get to the JSON
    cleanJSON = uncleanJson.lstrip() \
                           .replace('squad.register_players($.parseJSON(\'', '') \
                           .replace('\'));','')
    
    # Extract out that useful info
    data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
             for p in json.loads(cleanJSON)
             if p['player'] is not None]
    
    
    print('position,slot_position,slug')
    for line in data:
        print(','.join(line))
    
    我将其复制并粘贴到python中得到的结果是:

    position,slot_position,slug
    ST,ST,paulo-henrique
    LM,LM,mugdat-celik
    CAM,CAM,soner-aydogdu
    RM,RM,petar-grbic
    GK,GK,fatih-ozturk
    CDM,CDM,eray-ataseven
    LB,LB,kadir-keles
    CB,CB,caner-osmanpasa
    CB,CB,mustafa-yumlu
    RM,RM,ioan-adrian-hora
    GK,GK,bora-kork
    

    编辑:对于初学者来说,这不是最容易阅读的代码。这是一个更容易阅读的版本

    # ... All that previous code 
    script = soup.find_all('script')[17]
    
    allScriptLines = script.text.split('\n')
    
    uncleanJson = None
    for line in allScriptLines:
         # Remove left whitespace (makes it easier to parse)
         cleaner_line = line.lstrip()
         if cleaner_line.startswith('squad.register_players($.parseJSON'):
              uncleanJson = cleaner_line
    
    cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
    
    print('position,slot_position,slug')
    for player in json.loads(cleanJSON):
         if player['player'] is not None:
             print(player['position'],player['data']['slot_position'],player['data']['slug']) 
    

    你的问题和问题在哪里?你的问题和问题在哪里?你能给我一些关于这个问题的示例代码吗?你能给我一些关于这个问题的示例代码吗?这很好,非常感谢你花时间向我解释如何解决这个问题。在阅读了您的代码之后,对于刚刚开始学习Python的初学者来说,这并不容易。这是一个非常好的解决方案,非常感谢您花时间向我解释如何解决这个问题。在阅读了代码之后,对于刚刚开始学习Python的初学者来说,这并不容易。