Python 熊猫-处理空细胞
我真的很难使用beautifulsoup将足球运动员的细节刮到一张可行的熊猫表中 问题是,我收集的一些数据是“额外的”,并且用废话填满了我表中的行。例如:Python 熊猫-处理空细胞,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我真的很难使用beautifulsoup将足球运动员的细节刮到一张可行的熊猫表中 问题是,我收集的一些数据是“额外的”,并且用废话填满了我表中的行。例如: import requests from bs4 import BeautifulSoup import pandas as pd import numpy as np HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0"}
page = requests.get('https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985', headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
playerdata = soup.find_all(class_='posrela')
names = [';'.join(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df = pd.DataFrame([sub.split(";") for sub in names])
print(df.replace('^$', np.nan, regex=True))
结果:
python testing5.py
0 1 2 3
0 David de Gea D. de Gea Keeper None
1 Sergio Romero S. Romero Keeper None
2 Joel Pereira J. Pereira Keeper None
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back None
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back None
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back None
9 Luke Shaw Luke Shaw Left-Back None
10 Matteo Darmian M. Darmian Right-Back None
11 Antonio Valencia A. Valencia Right-Back None
12 Nemanja Matic N. Matic Defensive Midfield None
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield None
15 Ander Herrera A. Herrera Central Midfield None
16 Marouane Fellaini M. Fellaini Central Midfield None
17 Ashley Young A. Young Left Midfield None
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield None
19 Juan Mata Juan Mata Attacking Midfield None
20 Jesse Lingard J. Lingard Left Wing None
21 Romelu Lukaku R. Lukaku Centre-Forward None
22 Anthony Martial A. Martial . Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward None
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
正如您所看到的,在我刮取空数据的地方,它将数据推到了错误的单元格中。你可能会问为什么我有第4列,我会在其中插入更多的数据,但现在我需要清理第3列
正如你所看到的,我已经尝试了一个正则表达式,在第一个实例中用NaN替换空格。但无论我怎么做,我似乎都无法“选择”空单元格。我找不到他们
当我试图把“名字”当作一个列表来对待时,解释者告诉我这不是一个列表,而是一个结果集
不知是否有人能帮上忙,作为一名编程能手,我取得了很大的进步,但遇到了困难。您可以使用后处理-将非
NaN
从第3列替换为第2列,并使用loc
:
另一个解决方案是:
或非常类似于:
我尝试稍微改进您的解决方案:
playerdata = soup.find_all(class_='posrela')
names = [list(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df.loc[df[3].notnull(), 2] = df[3]
df = df.drop(3, axis=1)
print (df)
0 1 2
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
playerdata = soup.find_all(class_='posrela')
names = []
for pt in playerdata:
L = list(pt.findAll(text=True))
#check length of list
if len(L) == 4:
#assign 4. value to 3.
L[2] = L[3]
#appenf first 3 values in list
names.append(L[:3])
df = pd.DataFrame(names)
另一个解决方案:
playerdata = soup.find_all(class_='posrela')
names = [list(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df.loc[df[3].notnull(), 2] = df[3]
df = df.drop(3, axis=1)
print (df)
0 1 2
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
playerdata = soup.find_all(class_='posrela')
names = []
for pt in playerdata:
L = list(pt.findAll(text=True))
#check length of list
if len(L) == 4:
#assign 4. value to 3.
L[2] = L[3]
#appenf first 3 values in list
names.append(L[:3])
df = pd.DataFrame(names)
如果您要提取更多数据,我建议您以一种易于放入数据框架的顺序提取所有数据。除非以正确的格式提取数据,否则必须连续运行不必要的清理操作
playerdata = soup.find_all(class_='inline-table')
names = [[x.find('img')['title'],
x.find_all(class_='spielprofil_tooltip')[-1].renderContents(),
x.find_all('tr')[-1].find('td').renderContents()] for x in playerdata]
df = pd.DataFrame(names,columns=['Name','Short','Position'])
Name Short Position
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
25 Romelu Lukaku Romelu Lukaku Centre-Forward
26 Paul Pogba Paul Pogba Central Midfield
27 Anthony Martial Anthony Martial Centre-Forward
28 Marcus Rashford Marcus Rashford Centre-Forward
29 Eric Bailly Eric Bailly Centre-Back
谢谢-这非常有效。现在让我想一想。我以前用过loc来挑出“细胞”,但其余的需要一些思考**刚刚注意到你的编辑,再次感谢。很高兴能帮上忙,我也稍微改进了你的解决方案并添加了另一个。周末愉快!回答得很好,我确实与beautifulsoup(正如您正确建议的那样)进行了斗争,以首先获得正确的源数据。显然我做得不太好!选择不太好。然而,我怀疑你是100%正确的,首先获得正确的源代码是一种更有效的做事方式,谢谢。@charliedontsurf,很高兴能提供帮助!我相信你将不得不放弃更多的数据。如果您使用的是chrome,我喜欢右键点击inspect,这是Web浏览的最佳工具。您可以沿着树向下移动并突出显示页面上的位置。然后试着用bs4一级一级地过滤它们:)我还有一个问题(我已经取得了很好的进展!)但是我scape返回的一些项目有时带有b'前缀,有时是[b'…]我不知道这是为什么!我想在继续之前清理结果。我想这与我正在抓取的数据类型有关。。。
print (df)
0 1 2
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
playerdata = soup.find_all(class_='inline-table')
names = [[x.find('img')['title'],
x.find_all(class_='spielprofil_tooltip')[-1].renderContents(),
x.find_all('tr')[-1].find('td').renderContents()] for x in playerdata]
df = pd.DataFrame(names,columns=['Name','Short','Position'])
Name Short Position
0 David de Gea D. de Gea Keeper
1 Sergio Romero S. Romero Keeper
2 Joel Pereira J. Pereira Keeper
3 Eric Bailly E. Bailly Centre-Back
4 Victor Lindelöf V. Lindelöf Centre-Back
5 Marcos Rojo M. Rojo Centre-Back
6 Chris Smalling C. Smalling Centre-Back
7 Phil Jones P. Jones Centre-Back
8 Daley Blind D. Blind Left-Back
9 Luke Shaw Luke Shaw Left-Back
10 Matteo Darmian M. Darmian Right-Back
11 Antonio Valencia A. Valencia Right-Back
12 Nemanja Matic N. Matic Defensive Midfield
13 Michael Carrick M. Carrick Defensive Midfield
14 Paul Pogba P. Pogba Central Midfield
15 Ander Herrera A. Herrera Central Midfield
16 Marouane Fellaini M. Fellaini Central Midfield
17 Ashley Young A. Young Left Midfield
18 Henrikh Mkhitaryan H. Mkhitaryan Attacking Midfield
19 Juan Mata Juan Mata Attacking Midfield
20 Jesse Lingard J. Lingard Left Wing
21 Romelu Lukaku R. Lukaku Centre-Forward
22 Anthony Martial A. Martial Centre-Forward
23 Marcus Rashford M. Rashford Centre-Forward
24 Zlatan Ibrahimovic Z. Ibrahimovic Centre-Forward
25 Romelu Lukaku Romelu Lukaku Centre-Forward
26 Paul Pogba Paul Pogba Central Midfield
27 Anthony Martial Anthony Martial Centre-Forward
28 Marcus Rashford Marcus Rashford Centre-Forward
29 Eric Bailly Eric Bailly Centre-Back