Python 新手试图搜集数据并将其分解

Python 新手试图搜集数据并将其分解,python,Python,我能够从网站上抓取一些数据,但我很难将其分解以在表格中显示 我使用的代码是: import pandas as pd import requests from bs4 import BeautifulSoup url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html' r = requests.get(url) soup = BeautifulSoup(r.text, "html.parser") tab

我能够从网站上抓取一些数据,但我很难将其分解以在表格中显示

我使用的代码是:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
tablesright = soup.find_all('td', 'right',)
Tables left = soup.find_all('td', 'left')
print (tablesright + tablesleft)
这给了我这样的结果:

====================== RESTART: E:/2017/Python2/box2.py   ======================
[<td class="right " data-stat="game_start_time">8:01 pm</td>, <td class="right " data-stat="visitor_pts">99</td>, <td class="right " data- stat="home_pts">102</td>, <td class="right " data-stat="game_start_time">10:30 pm</td>, <td class="right " data-stat="visitor_pts">122</td>, <td class="right " data-stat="home_pts">121</td>, <td class="right " data-stat="game_start_time">7:30 pm</td>, <td class="right " data-stat="visitor_pts">108</td>, <td class="right " data-stat="home_pts">100</td>, <td class="right " data-stat="game_start_time">8:30 pm</td>, <td class="right " data-stat="visitor_pts">117</td>, <td class="right " data-stat="home_pts">111</td>, <td class="right " data-stat="game_start_time">7:00 pm</td>, <td class="right " data-stat="visitor_pts">90</td>, <td class="right " data-stat="home_pts">102</td>, <
Game start time    Home team.     Score.   Away team.    Score
7pm.               Boston.        104.     Golden state.  103
把我的头发拔出来想弄清楚


Ta提前感谢

您可以尝试在pandas数据框中读取该数据,而不是使用html解析器,然后决定如何操作该数据框以显示所需的结果

例如:

import pandas as pd


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
dfs = pd.read_html(url, match="Start")
print(dfs[0])
pandas文档中的示例以及有关stackoverflow的许多问题。
酱汁:

这样就行了。根据您的需要调整它,然后使用Panda

import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

rows = soup.select('#schedule > tbody > tr')

for row in rows:
    rights = row.find_all("td", "right")
    lefts = row.find_all("td", "left")

    print rights[0].text, lefts[0].text, rights[1].text, lefts[1].text, rights[2].text

我不知道您是否希望使用pandas解决方案,这是一个没有它的解决方案,只需使用更高级的
attrs
关键字和标准Python
format
即可获得格式化的表

请注意,
格式的数字是手动选择的,不会根据实际数据进行调整

import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
game_start_times = soup.find_all('td', attrs={"data-stat": "game_start_time", "class": "right"})
visitor_team_names = soup.find_all('td', attrs={"data-stat": "visitor_team_name", "class": "left"})
visitor_ptss = soup.find_all('td', attrs={"data-stat": "visitor_pts", "class": "right"})
home_team_names = soup.find_all('td', attrs={"data-stat": "home_team_name", "class": "left"})
home_pts = soup.find_all('td', attrs={"data-stat": "home_pts", "class": "right"})

for i in range(len(game_start_times)):
    print('{:10s} {:28s} {:5s} {:28s} {:5s}'.format(game_start_times[i].text.strip(),
                                  visitor_team_names[i].text.strip(),
                                  visitor_ptss[i].text.strip(),
                                  home_team_names[i].text.strip(),
                                  home_pts[i].text.strip()))


对于这样一个简单的结构,我只需要删除这些库,然后使用re(正则表达式)就可以了

第一个findall获取所有tr标记

然后一个findall获取每个tr标签内的所有td/th标签

然后一个sub过滤掉字段内的所有标记(主要是a标记)

或者按照您的示例的样式:

cols = [
        ['game_start_time',15,"Game start time"],
        ['home_team_name',25,"Home team."],
        ['home_pts',7,"Score."],
        ['visitor_team_name',25,"Away team."],
        ['visitor_pts',7,"Score."]
       ]

for col in cols:
  print ("%%%ds" % col[1]) % col[2],
print

for game in data:
  for col in cols:
    print ("%%%ds" % col[1]) % game[col[0]],
  print
这就产生了这样的结果:

Game start time Home team. Score. Away team. Score. 8:01 pm Cleveland Cavaliers 102 Boston Celtics 99 10:30 pm Golden State Warriors 121 Houston Rockets 122 7:30 pm Boston Celtics 100 Milwaukee Bucks 108 8:30 pm Dallas Mavericks 111 Atlanta Hawks 117 7:00 pm Detroit Pistons 102 Charlotte Hornets 90 7:00 pm Indiana Pacers 140 Brooklyn Nets 131 8:00 pm Memphis Grizzlies 103 New Orleans Pelicans 91 ... 比赛开始时间主队。分数客队。分数 晚上8:01克利夫兰骑士102波士顿凯尔特人99 晚上10:30金州勇士121休斯顿火箭122 晚上7:30波士顿凯尔特人队100密尔沃基雄鹿108 下午8:30达拉斯小牛111亚特兰大老鹰117 晚上7:00底特律活塞102夏洛特黄蜂90 下午7:00印第安纳步行者队140布鲁克林网队131 晚上8:00孟菲斯灰熊103新奥尔良鹈鹕91 ...
向上投票,因为他们花时间将答案格式化,完全像OP要求的那样:DHi伙计们。谢谢你的帮助。我理解它,它对我也有用。谢谢。另一个简短的问题,在我的网站上。第一个字段是那个日期,为什么刮不回那个日期呢?它返回的第一个字段是时间……我需要使用不同的代码来获取它吗?日期是a,而不是a。如图所示,您必须将其刮除:
date=row。查找所有(“th”,“left”)
,然后:
打印日期[0]。text
i。谢谢你的帮助。我理解它,它对我也有用。谢谢。另一个简短的问题,在我的网站上。第一个字段是那个日期,为什么刮不回那个日期呢?它返回的第一个字段是时间……我需要使用不同的代码来获取它吗?
#!/usr/bin/python

import requests
import re

url = 'https://www.basketball-
reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)
content = r.content

data = [
    {
            k:re.sub('<.+?>','',v) for (k,v) in re.findall('<t[dh].+?data\-stat="(.*?)".*?>(.*?)</t[dh]',tr)
    } for tr in re.findall('<tr.+?>(.+?)</tr',content)
    ]

for game in data:
  print "%s" % game['date_game']
  for info in game:
    print "  %s = %s" % (info,game[info])
$ ./scores_url.py 
Tue, Oct 17, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Cleveland Cavaliers
  visitor_team_name = Boston Celtics
  game_start_time = 8:01 pm
  date_game = Tue, Oct 17, 2017
  overtimes = 
  visitor_pts = 99
  home_pts = 102
Tue, Oct 17, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Golden State Warriors
  visitor_team_name = Houston Rockets
  game_start_time = 10:30 pm
  date_game = Tue, Oct 17, 2017
  overtimes = 
  visitor_pts = 122
  home_pts = 121
Wed, Oct 18, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Boston Celtics
  visitor_team_name = Milwaukee Bucks
  game_start_time = 7:30 pm
  date_game = Wed, Oct 18, 2017
  overtimes = 
  visitor_pts = 108
  home_pts = 100
...
cols = [
        ['game_start_time',15,"Game start time"],
        ['home_team_name',25,"Home team."],
        ['home_pts',7,"Score."],
        ['visitor_team_name',25,"Away team."],
        ['visitor_pts',7,"Score."]
       ]

for col in cols:
  print ("%%%ds" % col[1]) % col[2],
print

for game in data:
  for col in cols:
    print ("%%%ds" % col[1]) % game[col[0]],
  print
Game start time Home team. Score. Away team. Score. 8:01 pm Cleveland Cavaliers 102 Boston Celtics 99 10:30 pm Golden State Warriors 121 Houston Rockets 122 7:30 pm Boston Celtics 100 Milwaukee Bucks 108 8:30 pm Dallas Mavericks 111 Atlanta Hawks 117 7:00 pm Detroit Pistons 102 Charlotte Hornets 90 7:00 pm Indiana Pacers 140 Brooklyn Nets 131 8:00 pm Memphis Grizzlies 103 New Orleans Pelicans 91 ...