Python 如何从网页的JSON/Javascript中提取数据？_Python_Python 3.x_Beautifulsoup

Python 如何从网页的JSON/Javascript中提取数据？

python python-3.x

Python 如何从网页的JSON/Javascript中提取数据？,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,我是Python新手，今天就开始吧。我的系统环境是python3.5，在Windows10上有一些库我想从下面的网站中提取足球运动员数据作为CSV文件问题：我无法从汤中提取数据。查找所有（'script'）[17]到我期望的CSV格式。如何提取我想要的数据我的代码如下所示 from bs4 import BeautifulSoup import re from urllib.request import Request, urlopen req = Request('http://ww

我是Python新手，今天就开始吧。
我的系统环境是

python3.5

，在

Windows10

上有一些库

我想从下面的网站中提取足球运动员数据作为CSV文件

问题：我无法从

汤中提取数据。查找所有（'script'）[17]

到我期望的CSV格式。如何提取我想要的数据

我的代码如下所示

from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th

我的预期输出与此类似

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik

所以我的理解是beautifulsoup更适合于HTML解析，但您正在尝试解析嵌套在HTML中的javascript

所以你有两个选择

只需创建一个函数，该函数接受soup.find_all（'script'）[17]，循环并手动搜索字符串中的数据并提取它。您甚至可以使用ast.literal\u eval（string\u that\u really\u a\u dictionary）使其更加简单。这可能不是最好的方法，但如果您是python新手，您可能希望这样做只是为了练习

或者这可能是更好的方法

所以我的理解是beautifulsoup更适合于HTML解析，但您正在尝试解析嵌套在HTML中的javascript

所以你有两个选择

或者这可能是更好的方法

正如@josiah Swain所说，这不会是一件美好的事情。对于这类事情，更推荐使用JS，因为它可以理解您所拥有的

这么说来，python真是太棒了，下面是您的解决方案

#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

#And one more
import json

# The code you had 
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
               headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')

# Store the script 
script = soup.find_all('script')[17]

# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n') 
         if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]

# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
                       .replace('squad.register_players($.parseJSON(\'', '') \
                       .replace('\'));','')

# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
         for p in json.loads(cleanJSON)
         if p['player'] is not None]


print('position,slot_position,slug')
for line in data:
    print(','.join(line))

我将其复制并粘贴到python中得到的结果是：

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork

编辑：对于初学者来说，这不是最容易阅读的代码。这是一个更容易阅读的版本

# ... All that previous code 
script = soup.find_all('script')[17]

allScriptLines = script.text.split('\n')

uncleanJson = None
for line in allScriptLines:
     # Remove left whitespace (makes it easier to parse)
     cleaner_line = line.lstrip()
     if cleaner_line.startswith('squad.register_players($.parseJSON'):
          uncleanJson = cleaner_line

cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')

print('position,slot_position,slug')
for player in json.loads(cleanJSON):
     if player['player'] is not None:
         print(player['position'],player['data']['slot_position'],player['data']['slug'])

正如@josiah Swain所说，这不会是一件美好的事情。对于这类事情，更推荐使用JS，因为它可以理解您所拥有的

这么说来，python真是太棒了，下面是您的解决方案

#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

#And one more
import json

# The code you had 
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
               headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')

# Store the script 
script = soup.find_all('script')[17]

# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n') 
         if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]

# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
                       .replace('squad.register_players($.parseJSON(\'', '') \
                       .replace('\'));','')

# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
         for p in json.loads(cleanJSON)
         if p['player'] is not None]


print('position,slot_position,slug')
for line in data:
    print(','.join(line))

我将其复制并粘贴到python中得到的结果是：

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork

编辑：对于初学者来说，这不是最容易阅读的代码。这是一个更容易阅读的版本

# ... All that previous code 
script = soup.find_all('script')[17]

allScriptLines = script.text.split('\n')

uncleanJson = None
for line in allScriptLines:
     # Remove left whitespace (makes it easier to parse)
     cleaner_line = line.lstrip()
     if cleaner_line.startswith('squad.register_players($.parseJSON'):
          uncleanJson = cleaner_line

cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')

print('position,slot_position,slug')
for player in json.loads(cleanJSON):
     if player['player'] is not None:
         print(player['position'],player['data']['slot_position'],player['data']['slug'])

你的问题和问题在哪里？你的问题和问题在哪里？你能给我一些关于这个问题的示例代码吗？你能给我一些关于这个问题的示例代码吗？这很好，非常感谢你花时间向我解释如何解决这个问题。在阅读了您的代码之后，对于刚刚开始学习Python的初学者来说，这并不容易。这是一个非常好的解决方案，非常感谢您花时间向我解释如何解决这个问题。在阅读了代码之后，对于刚刚开始学习Python的初学者来说，这并不容易。