在使用Python进行web抓取时,如何删除pandas数据框中的字符?
我正在尝试使用Python 3将此网站的一个表从web上刮取到一个.csv文件中: 表格开始时如下所示:在使用Python进行web抓取时,如何删除pandas数据框中的字符?,python,pandas,web-scraping,beautifulsoup,screen-scraping,Python,Pandas,Web Scraping,Beautifulsoup,Screen Scraping,我正在尝试使用Python 3将此网站的一个表从web上刮取到一个.csv文件中: 表格开始时如下所示: Revised Schedule Original Schedule Date Time Game Net Time Game Net Sun., 12/25/11 12 PM BOS (1) at
Revised Schedule Original Schedule
Date Time Game Net Time Game Net
Sun., 12/25/11 12 PM BOS (1) at NY (1) TNT 12 PM BOS (7) at NY (7) ESPN
Sun., 12/25/11 2:30 PM MIA (1) at DAL (1) ABC 2:30 PM MIA (8) at DAL (5) ABC
Sun., 12/25/11 5 PM CHI (1) at LAL (1) ABC 5 PM CHI (6) at LAL (9) ABC
Sun., 12/25/11 8 PM ORL (1) at OKC (1) ESPN no game no game no game
Sun., 12/25/11 10:30 PM LAC (1) at GS (1) ESPN no game no game no game
Tue., 12/27/11 8 PM BOS (2) at MIA (2) TNT no game no game no game
Tue., 12/27/11 10:30 PM UTA (1) at LAL (2) TNT no game no game no game
我只对修订后的附表感兴趣,它是前4栏。我希望.csv文件中的输出如下所示:
我正在使用这些软件包:
import re
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from itertools import groupby
这是我为匹配所需格式而编写的代码:
df = pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/", header=0)[0]
revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]
df = df[revisedCols]
df.columns = df.iloc[0,:]
df = df.iloc[1:,:].reset_index(drop=True)
# Format Date to m/d/y
df['Date'] = np.where(df.Date.str.startswith(('10/', '11/', '12/')), df.Date + ' 11', df.Date + ' 12')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
# Split the Game column
df[['Away','Home']] = df.Game.str.split('at',expand=True)
# Final dataframe with desired columns
df = df[['Date','Time','Away','Home','Net']]
df.columns = ['Date', 'Time', 'Away', 'Home', 'Network']
print(df)
输出:
Date Time Away Home Network
0 12/25/2011 12 PM BOS (1) NY (1) TNT
1 12/25/2011 2:30 PM MIA (1) DAL (1) ABC
2 12/25/2011 5 PM CHI (1) LAL (1) ABC
3 12/25/2011 8 PM ORL (1) OKC (1) ESPN
4 12/25/2011 10:30 PM LAC (1) GS (1) ESPN
5 12/27/2011 8 PM BOS (2) MIA (2) TNT
6 12/27/2011 10:30 PM UTA (1) LAL (2) TNT
我注意到在客场和主场栏中,每个队名旁边都有(1)、(2)等如何使用刮刀删除客场和主场列中每个球队名称旁边的(1)、(2)等?您可以使用括号和数字,并且在开头或结尾似乎有一些空格:
df['Away'] = df['Away'].str.replace('\(\d*\)', '').str.strip()
df['Home'] = df['Home'].str.replace('\(\d*\)', '').str.strip()
print (df.head())
Date Time Away Home Network
0 12/25/2011 12 PM BOS NY TNT
1 12/25/2011 2:30 PM MIA DAL ABC
2 12/25/2011 5 PM CHI LAL ABC
3 12/25/2011 8 PM ORL OKC ESPN
4 12/25/2011 10:30 PM LAC GS ESPN
您可以在拆分游戏列后添加此代码
df['Away']=df['Away'].astype(str).str[0:-4]
df['Home']=df['Home'].astype(str).str[0:-4]
不要在
“at
”处拆分游戏列,而不要特别声明分隔符.split()
将在每个空白处分割,然后您只需要0索引和第3索引值。因此,只需更改一行代码:
fromdf[['Away','Home']]=df.Game.str.split('at',expand=True)
到df[['Away','Home']]=df.Game.str.split(expand=True)[[0,3]]
import pandas as pd
import numpy as np
df = pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/", header=0)[0]
revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]
df = df[revisedCols]
df.columns = df.iloc[0,:]
df = df.iloc[1:,:].reset_index(drop=True)
# Format Date to m/d/y
df['Date'] = np.where(df.Date.str.startswith(('10/', '11/', '12/')), df.Date + ' 11', df.Date + ' 12')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
# Split the Game column
df[['Away','Home']] = df.Game.str.split(expand=True)[[0,3]]
# Final dataframe with desired columns
df = df[['Date','Time','Away','Home','Net']]
df.columns = ['Date', 'Time', 'Away', 'Home', 'Network']
print(df)
这将替换列标题而不是值。在这里,OP想替换立柱内的阀门。
import pandas as pd
import numpy as np
df = pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/", header=0)[0]
revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]
df = df[revisedCols]
df.columns = df.iloc[0,:]
df = df.iloc[1:,:].reset_index(drop=True)
# Format Date to m/d/y
df['Date'] = np.where(df.Date.str.startswith(('10/', '11/', '12/')), df.Date + ' 11', df.Date + ' 12')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
# Split the Game column
df[['Away','Home']] = df.Game.str.split(expand=True)[[0,3]]
# Final dataframe with desired columns
df = df[['Date','Time','Away','Home','Net']]
df.columns = ['Date', 'Time', 'Away', 'Home', 'Network']
print(df)