Python 使用BeautifulSoup抓取Web数据
我正在努力从rotowire.com上获取每场棒球比赛的降雨机会和温度/风速。一旦我刮取了数据,我将把它转换成三列——雨、温度和风。多亏了另一位用户,我才能够接近获取数据,但却无法完全做到这一点。我试过两种方法 第一种方法:Python 使用BeautifulSoup抓取Web数据,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我正在努力从rotowire.com上获取每场棒球比赛的降雨机会和温度/风速。一旦我刮取了数据,我将把它转换成三列——雨、温度和风。多亏了另一位用户,我才能够接近获取数据,但却无法完全做到这一点。我试过两种方法 第一种方法: from bs4 import BeautifulSoup import requests import pandas as pd url = 'https://www.rotowire.com/baseball/daily-lineups.php' r = reques
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.rotowire.com/baseball/daily-lineups.php'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
weather = []
for i in soup.select(".lineup__bottom"):
forecast = i.select_one('.lineup__weather-text').text
weather.append(forecast)
这将返回:
['\n100% Rain\r\n 66°\xa0\xa0Wind 8 mph In ', '\n0% Rain\r\n 64°\xa0\xa0Wind 4 mph L-R ', '\n0% Rain\r\n 69°\xa0\xa0Wind 7 mph In ', '\nDome\r\n In Domed Stadium\r\n ', '\n0% Rain\r\n 75°\xa0\xa0Wind 10 mph Out ', '\n0% Rain\r\n 68°\xa0\xa0Wind 9 mph R-L ', '\n0% Rain\r\n 82°\xa0\xa0Wind 9 mph ', '\n0% Rain\r\n 81°\xa0\xa0Wind 5 mph R-L ', '\nDome\r\n In Domed Stadium\r\n ', '\n1% Rain\r\n 75°\xa0\xa0Wind 4 mph R-L ', '\n1% Rain\r\n 71°\xa0\xa0Wind 6 mph Out ', '\nDome\r\n In Domed Stadium\r\n ']
我尝试过的第二种方法是:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.rotowire.com/baseball/daily-lineups.php'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
#weather = []
for i in soup.select(".lineup__bottom"):
forecast = i.select_one('.lineup__weather-text').text
weather.append(forecast)
#print(forecast)
rain = i.select_one('.lineup__weather-text b:contains("Rain") ~ span').text
这将返回一个
属性错误,即“非类型”对象没有属性“文本”
您可以找到带有游戏信息的卡,并在底部找到天气数据(如果存在):
要查找所有数据,请参见此示例:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
weather = []
for tag in soup.select(".lineup__bottom"):
header = tag.find_previous(class_="lineup__teams").get_text(
strip=True, separator=" vs "
)
rain = tag.select_one(".lineup__weather-text > b")
forecast_info = rain.next_sibling.split()
temp = forecast_info[0]
wind = forecast_info[2]
weather.append(
{"Header": header, "Rain": rain.text.split()[0], "Temp": temp, "Wind": wind}
)
df = pd.DataFrame(weather)
print(df)
输出:
header rain temperature wind
0 PHI vs CIN 100% Rain 66 8 mph In
1 CWS vs CLE 0% Rain 64 4 mph L-R
2 SD vs CHC 0% Rain 69 7 mph In
3 NYM vs ARI Dome In Domed Stadium In Domed Stadium
4 MIN vs BAL 0% Rain 75 9 mph Out
5 TB vs NYY 0% Rain 68 9 mph R-L
6 MIA vs TOR 0% Rain 81 6 mph L-R
7 WAS vs ATL 0% Rain 81 4 mph R-L
8 BOS vs HOU Dome In Domed Stadium In Domed Stadium
9 TEX vs COL 0% Rain 76 6 mph
10 STL vs LAD 0% Rain 73 4 mph Out
11 OAK vs SEA Dome In Domed Stadium In Domed Stadium
Header Rain Temp Wind
0 PHI vs CIN 100% 66° 8
1 CWS vs CLE 0% 64° 4
2 SD vs CHC 0% 69° 7
3 NYM vs ARI Dome In Stadium
4 MIN vs BAL 0% 75° 9
5 TB vs NYY 0% 68° 9
6 MIA vs TOR 0% 81° 6
7 WAS vs ATL 0% 81° 4
8 BOS vs HOU Dome In Stadium
9 TEX vs COL 0% 76° 6
10 STL vs LAD 0% 73° 4
11 OAK vs SEA Dome In Stadium
该死。比我先到+。我使用
在soup中列出。选择('.lineup:not(.is-ad,.is-tools)):
和max-split-arg用于拆分临时风位。@QHarr我也花了一些时间:-)我也想看看你的方法。(除非是一样的。)差别还不够。我喜欢你加上谁在玩。我对此犹豫不决。@QHarr你仍然可以把它作为一个答案发布。我将upvote@ShawnSchreier
Header Rain Temp Wind
0 PHI vs CIN 100% 66° 8
1 CWS vs CLE 0% 64° 4
2 SD vs CHC 0% 69° 7
3 NYM vs ARI Dome In Stadium
4 MIN vs BAL 0% 75° 9
5 TB vs NYY 0% 68° 9
6 MIA vs TOR 0% 81° 6
7 WAS vs ATL 0% 81° 4
8 BOS vs HOU Dome In Stadium
9 TEX vs COL 0% 76° 6
10 STL vs LAD 0% 73° 4
11 OAK vs SEA Dome In Stadium