Python 熊猫-处理空细胞

Python 熊猫-处理空细胞,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我真的很难使用beautifulsoup将足球运动员的细节刮到一张可行的熊猫表中 问题是,我收集的一些数据是“额外的”,并且用废话填满了我表中的行。例如: import requests from bs4 import BeautifulSoup import pandas as pd import numpy as np HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101

我真的很难使用beautifulsoup将足球运动员的细节刮到一张可行的熊猫表中

问题是,我收集的一些数据是“额外的”,并且用废话填满了我表中的行。例如:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0"}

page = requests.get('https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985', headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')

playerdata = soup.find_all(class_='posrela')
names = [';'.join(pt.findAll(text=True)) for pt in playerdata]

df = pd.DataFrame(names)
df = pd.DataFrame([sub.split(";") for sub in names])

print(df.replace('^$', np.nan, regex=True))
结果:

 python testing5.py
                     0               1                   2                   3
0         David de Gea       D. de Gea              Keeper                None
1        Sergio Romero       S. Romero              Keeper                None
2         Joel Pereira      J. Pereira              Keeper                None
3          Eric Bailly       E. Bailly                             Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back                None
5          Marcos Rojo         M. Rojo                             Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back                None
7           Phil Jones        P. Jones                             Centre-Back
8          Daley Blind        D. Blind           Left-Back                None
9            Luke Shaw       Luke Shaw           Left-Back                None
10      Matteo Darmian      M. Darmian          Right-Back                None
11    Antonio Valencia     A. Valencia          Right-Back                None
12       Nemanja Matic        N. Matic  Defensive Midfield                None
13     Michael Carrick      M. Carrick                      Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield                None
15       Ander Herrera      A. Herrera    Central Midfield                None
16   Marouane Fellaini     M. Fellaini    Central Midfield                None
17        Ashley Young        A. Young       Left Midfield                None
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield                None
19           Juan Mata       Juan Mata  Attacking Midfield                None
20       Jesse Lingard      J. Lingard           Left Wing                None
21       Romelu Lukaku       R. Lukaku      Centre-Forward                None
22     Anthony Martial      A. Martial                   .      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward                None
24  Zlatan Ibrahimovic  Z. Ibrahimovic                          Centre-Forward
正如您所看到的,在我刮取空数据的地方,它将数据推到了错误的单元格中。你可能会问为什么我有第4列,我会在其中插入更多的数据,但现在我需要清理第3列

正如你所看到的,我已经尝试了一个正则表达式,在第一个实例中用NaN替换空格。但无论我怎么做,我似乎都无法“选择”空单元格。我找不到他们

当我试图把“名字”当作一个列表来对待时,解释者告诉我这不是一个列表,而是一个结果集


不知是否有人能帮上忙,作为一名编程能手,我取得了很大的进步,但遇到了困难。

您可以使用后处理-将非
NaN
从第3列替换为第2列,并使用
loc

另一个解决方案是:

或非常类似于:

我尝试稍微改进您的解决方案:

playerdata = soup.find_all(class_='posrela')
names = [list(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df.loc[df[3].notnull(), 2] = df[3]
df = df.drop(3, axis=1)
print (df)

                     0               1                   2
0         David de Gea       D. de Gea              Keeper
1        Sergio Romero       S. Romero              Keeper
2         Joel Pereira      J. Pereira              Keeper
3          Eric Bailly       E. Bailly         Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back
5          Marcos Rojo         M. Rojo         Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back
7           Phil Jones        P. Jones         Centre-Back
8          Daley Blind        D. Blind           Left-Back
9            Luke Shaw       Luke Shaw           Left-Back
10      Matteo Darmian      M. Darmian          Right-Back
11    Antonio Valencia     A. Valencia          Right-Back
12       Nemanja Matic        N. Matic  Defensive Midfield
13     Michael Carrick      M. Carrick  Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield
15       Ander Herrera      A. Herrera    Central Midfield
16   Marouane Fellaini     M. Fellaini    Central Midfield
17        Ashley Young        A. Young       Left Midfield
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield
19           Juan Mata       Juan Mata  Attacking Midfield
20       Jesse Lingard      J. Lingard           Left Wing
21       Romelu Lukaku       R. Lukaku      Centre-Forward
22     Anthony Martial      A. Martial      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic  Z. Ibrahimovic      Centre-Forward
playerdata = soup.find_all(class_='posrela')

names = []
for pt in playerdata:
   L = list(pt.findAll(text=True))
   #check length of list
   if len(L) == 4:
      #assign 4. value to 3. 
      L[2] = L[3]
   #appenf first 3 values in list 
   names.append(L[:3])

df = pd.DataFrame(names)
另一个解决方案:

playerdata = soup.find_all(class_='posrela')
names = [list(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df.loc[df[3].notnull(), 2] = df[3]
df = df.drop(3, axis=1)
print (df)

                     0               1                   2
0         David de Gea       D. de Gea              Keeper
1        Sergio Romero       S. Romero              Keeper
2         Joel Pereira      J. Pereira              Keeper
3          Eric Bailly       E. Bailly         Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back
5          Marcos Rojo         M. Rojo         Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back
7           Phil Jones        P. Jones         Centre-Back
8          Daley Blind        D. Blind           Left-Back
9            Luke Shaw       Luke Shaw           Left-Back
10      Matteo Darmian      M. Darmian          Right-Back
11    Antonio Valencia     A. Valencia          Right-Back
12       Nemanja Matic        N. Matic  Defensive Midfield
13     Michael Carrick      M. Carrick  Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield
15       Ander Herrera      A. Herrera    Central Midfield
16   Marouane Fellaini     M. Fellaini    Central Midfield
17        Ashley Young        A. Young       Left Midfield
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield
19           Juan Mata       Juan Mata  Attacking Midfield
20       Jesse Lingard      J. Lingard           Left Wing
21       Romelu Lukaku       R. Lukaku      Centre-Forward
22     Anthony Martial      A. Martial      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic  Z. Ibrahimovic      Centre-Forward
playerdata = soup.find_all(class_='posrela')

names = []
for pt in playerdata:
   L = list(pt.findAll(text=True))
   #check length of list
   if len(L) == 4:
      #assign 4. value to 3. 
      L[2] = L[3]
   #appenf first 3 values in list 
   names.append(L[:3])

df = pd.DataFrame(names)


如果您要提取更多数据,我建议您以一种易于放入数据框架的顺序提取所有数据。除非以正确的格式提取数据,否则必须连续运行不必要的清理操作

playerdata = soup.find_all(class_='inline-table')

names = [[x.find('img')['title'],
         x.find_all(class_='spielprofil_tooltip')[-1].renderContents(),
         x.find_all('tr')[-1].find('td').renderContents()] for x in playerdata]

df = pd.DataFrame(names,columns=['Name','Short','Position'])


                  Name            Short            Position
0         David de Gea        D. de Gea              Keeper
1        Sergio Romero        S. Romero              Keeper
2         Joel Pereira       J. Pereira              Keeper
3          Eric Bailly        E. Bailly         Centre-Back
4      Victor Lindelöf      V. Lindelöf         Centre-Back
5          Marcos Rojo          M. Rojo         Centre-Back
6       Chris Smalling      C. Smalling         Centre-Back
7           Phil Jones         P. Jones         Centre-Back
8          Daley Blind         D. Blind           Left-Back
9            Luke Shaw        Luke Shaw           Left-Back
10      Matteo Darmian       M. Darmian          Right-Back
11    Antonio Valencia      A. Valencia          Right-Back
12       Nemanja Matic         N. Matic  Defensive Midfield
13     Michael Carrick       M. Carrick  Defensive Midfield
14          Paul Pogba         P. Pogba    Central Midfield
15       Ander Herrera       A. Herrera    Central Midfield
16   Marouane Fellaini      M. Fellaini    Central Midfield
17        Ashley Young         A. Young       Left Midfield
18  Henrikh Mkhitaryan    H. Mkhitaryan  Attacking Midfield
19           Juan Mata        Juan Mata  Attacking Midfield
20       Jesse Lingard       J. Lingard           Left Wing
21       Romelu Lukaku        R. Lukaku      Centre-Forward
22     Anthony Martial       A. Martial      Centre-Forward
23     Marcus Rashford      M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic   Z. Ibrahimovic      Centre-Forward
25       Romelu Lukaku    Romelu Lukaku      Centre-Forward
26          Paul Pogba       Paul Pogba    Central Midfield
27     Anthony Martial  Anthony Martial      Centre-Forward
28     Marcus Rashford  Marcus Rashford      Centre-Forward
29         Eric Bailly      Eric Bailly         Centre-Back

谢谢-这非常有效。现在让我想一想。我以前用过loc来挑出“细胞”,但其余的需要一些思考**刚刚注意到你的编辑,再次感谢。很高兴能帮上忙,我也稍微改进了你的解决方案并添加了另一个。周末愉快!回答得很好,我确实与beautifulsoup(正如您正确建议的那样)进行了斗争,以首先获得正确的源数据。显然我做得不太好!选择不太好。然而,我怀疑你是100%正确的,首先获得正确的源代码是一种更有效的做事方式,谢谢。@charliedontsurf,很高兴能提供帮助!我相信你将不得不放弃更多的数据。如果您使用的是chrome,我喜欢右键点击inspect,这是Web浏览的最佳工具。您可以沿着树向下移动并突出显示页面上的位置。然后试着用bs4一级一级地过滤它们:)我还有一个问题(我已经取得了很好的进展!)但是我scape返回的一些项目有时带有b'前缀,有时是[b'…]我不知道这是为什么!我想在继续之前清理结果。我想这与我正在抓取的数据类型有关。。。
print (df)
                     0               1                   2
0         David de Gea       D. de Gea              Keeper
1        Sergio Romero       S. Romero              Keeper
2         Joel Pereira      J. Pereira              Keeper
3          Eric Bailly       E. Bailly         Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back
5          Marcos Rojo         M. Rojo         Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back
7           Phil Jones        P. Jones         Centre-Back
8          Daley Blind        D. Blind           Left-Back
9            Luke Shaw       Luke Shaw           Left-Back
10      Matteo Darmian      M. Darmian          Right-Back
11    Antonio Valencia     A. Valencia          Right-Back
12       Nemanja Matic        N. Matic  Defensive Midfield
13     Michael Carrick      M. Carrick  Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield
15       Ander Herrera      A. Herrera    Central Midfield
16   Marouane Fellaini     M. Fellaini    Central Midfield
17        Ashley Young        A. Young       Left Midfield
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield
19           Juan Mata       Juan Mata  Attacking Midfield
20       Jesse Lingard      J. Lingard           Left Wing
21       Romelu Lukaku       R. Lukaku      Centre-Forward
22     Anthony Martial      A. Martial      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic  Z. Ibrahimovic      Centre-Forward
playerdata = soup.find_all(class_='inline-table')

names = [[x.find('img')['title'],
         x.find_all(class_='spielprofil_tooltip')[-1].renderContents(),
         x.find_all('tr')[-1].find('td').renderContents()] for x in playerdata]

df = pd.DataFrame(names,columns=['Name','Short','Position'])


                  Name            Short            Position
0         David de Gea        D. de Gea              Keeper
1        Sergio Romero        S. Romero              Keeper
2         Joel Pereira       J. Pereira              Keeper
3          Eric Bailly        E. Bailly         Centre-Back
4      Victor Lindelöf      V. Lindelöf         Centre-Back
5          Marcos Rojo          M. Rojo         Centre-Back
6       Chris Smalling      C. Smalling         Centre-Back
7           Phil Jones         P. Jones         Centre-Back
8          Daley Blind         D. Blind           Left-Back
9            Luke Shaw        Luke Shaw           Left-Back
10      Matteo Darmian       M. Darmian          Right-Back
11    Antonio Valencia      A. Valencia          Right-Back
12       Nemanja Matic         N. Matic  Defensive Midfield
13     Michael Carrick       M. Carrick  Defensive Midfield
14          Paul Pogba         P. Pogba    Central Midfield
15       Ander Herrera       A. Herrera    Central Midfield
16   Marouane Fellaini      M. Fellaini    Central Midfield
17        Ashley Young         A. Young       Left Midfield
18  Henrikh Mkhitaryan    H. Mkhitaryan  Attacking Midfield
19           Juan Mata        Juan Mata  Attacking Midfield
20       Jesse Lingard       J. Lingard           Left Wing
21       Romelu Lukaku        R. Lukaku      Centre-Forward
22     Anthony Martial       A. Martial      Centre-Forward
23     Marcus Rashford      M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic   Z. Ibrahimovic      Centre-Forward
25       Romelu Lukaku    Romelu Lukaku      Centre-Forward
26          Paul Pogba       Paul Pogba    Central Midfield
27     Anthony Martial  Anthony Martial      Centre-Forward
28     Marcus Rashford  Marcus Rashford      Centre-Forward
29         Eric Bailly      Eric Bailly         Centre-Back