Python 3.x 使用BeautifulSoup和Requests和Pandas从中的中刮取数据

Python 3.x 使用BeautifulSoup和Requests和Pandas从中的中刮取数据,python-3.x,pandas,beautifulsoup,python-requests-html,Python 3.x,Pandas,Beautifulsoup,Python Requests Html,我试图从这段HTML代码中的中间提取T和0-0以及2 OT。我开始写下面的代码,但太多的新手,无法理解它。谢谢你的帮助 <div class ="sidearm-schedule-game-details flex item-1 columns"> == $0 <div class="sidearm-schedule-game-result text-italic"> == $0 <span></span

我试图从这段HTML代码中的中间提取T和0-0以及2 OT。我开始写下面的代码,但太多的新手,无法理解它。谢谢你的帮助


    <div class ="sidearm-schedule-game-details flex item-1 columns"> == $0
        <div class="sidearm-schedule-game-result text-italic"> == $0
            <span></span>
            <span>T,</span>
            <span>0-0</span>
            <span>(2 OT)</span>
        </div>


我想你在寻找类似的东西:

import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup


url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')

rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

sheet = pd.DataFrame()
for row in rows:
    result = row.find('div',class_="sidearm-schedule-game-result").text.strip().replace('\n', ', ')
    df = pd.DataFrame([[result]], columns=['result'])
    sheet = sheet.append(df).reset_index(drop=True)
这将导致工作表的内容如下所示:

           result
0          L, 1-2
1     L, 1-2 (OT)
2          W, 1-0
3          W, 1-0
4          L, 1-2
5   W, 1-0 (2 OT)
6   T, 0-0 (2 OT)
7          W, 3-0
8     L, 2-3 (OT)
9     W, 2-1 (OT)
10         W, 1-0
11         W, 1-0
12         L, 0-1
13  T, 0-0 (2 OT)
14         L, 0-1
15         W, 1-0
16         L, 0-1
17         W, 3-1
18         L, 1-2

我想你在寻找类似的东西:

import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup


url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')

rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

sheet = pd.DataFrame()
for row in rows:
    result = row.find('div',class_="sidearm-schedule-game-result").text.strip().replace('\n', ', ')
    df = pd.DataFrame([[result]], columns=['result'])
    sheet = sheet.append(df).reset_index(drop=True)
这将导致工作表的内容如下所示:

           result
0          L, 1-2
1     L, 1-2 (OT)
2          W, 1-0
3          W, 1-0
4          L, 1-2
5   W, 1-0 (2 OT)
6   T, 0-0 (2 OT)
7          W, 3-0
8     L, 2-3 (OT)
9     W, 2-1 (OT)
10         W, 1-0
11         W, 1-0
12         L, 0-1
13  T, 0-0 (2 OT)
14         L, 0-1
15         W, 1-0
16         L, 0-1
17         W, 3-1
18         L, 1-2

仅使用xpath,我将执行以下操作:

    a = html.xpath('//div[@class, "sidearm-schedule-game-result"]')
    #select all nodes that start with a <div> and have "sidearm-schedule-game-result" in the class.
    for each in a:
         b = each.xpath('.//span/text()')
         #the './/' will only look at subelements of what you selected earlier and text() will extract the text from that field.
         print(b)

仅使用xpath,我将执行以下操作:

    a = html.xpath('//div[@class, "sidearm-schedule-game-result"]')
    #select all nodes that start with a <div> and have "sidearm-schedule-game-result" in the class.
    for each in a:
         b = each.xpath('.//span/text()')
         #the './/' will only look at subelements of what you selected earlier and text() will extract the text from that field.
         print(b)
您可以使用re模块解析s中的文本,并将每个信息存储在单独的列Result、Score、OT中

例如:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')

rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

data = []
for row in rows:
    opponent = row.select_one('.sidearm-schedule-game-opponent-logo img')['alt'].rsplit(maxsplit=1)[0]
    name_date = row.select_one('.sidearm-schedule-game-opponent-name a')['aria-label']

    result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0]

    data.append([opponent, *result, name_date])

df = pd.DataFrame(data, columns=['Name', 'Result', 'Score', 'OT', 'Info'])
print(df)
印刷品:

                            Name Result Score      OT                                             Info
0      University of Connecticut      L   1-2                                UConn on August 24 7 p.m.
1              Drexel University      L   1-2    (OT)                       Drexel on August 27 7 p.m.
2   George Washington University      W   1-0                  George Washington on September 1 4 p.m.
3          St. John's University      W   1-0                      St. John's on September 4 7:30 p.m.
4          Binghamton University      L   1-2                         Binghamton on September 7 8 p.m.
5               Rider University      W   1-0  (2 OT)                     Rider on September 11 7 p.m.
6     University of Pennsylvania      T   0-0  (2 OT)                      Penn on September 15 6 p.m.
7                           Army      W   3-0                              Army on September 22 7 p.m.
8             Cornell University      L   2-3    (OT)                   Cornell on September 25 7 p.m.
9              Boston University      W   2-1    (OT)                  Boston U on September 29 4 p.m.
10            Colgate University      W   1-0                              Colgate on October 3 7 p.m.
11   United States Naval Academy      W   1-0                                 Navy on October 6 6 p.m.
12             Lafayette College      L   0-1                          Lafayette on October 13 12 p.m.
13             Dartmouth College      T   0-0  (2 OT)                   Dartmouth on October 16 6 p.m.
14           American University      L   0-1                            American on October 20 6 p.m.
15           Bucknell University      W   1-0                            Bucknell on October 24 7 p.m.
16       Loyola University (Md.)      L   0-1                        Loyola (Md.) on October 27 3 p.m.
17                    Holy Cross      W   3-1                          Holy Cross on November 3 6 p.m.
18            Colgate University      L   1-2          No. 3 Colgate (Semifinals) on November 9 7 p.m.
您可以使用re模块解析s中的文本,并将每个信息存储在单独的列Result、Score、OT中

例如:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')

rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

data = []
for row in rows:
    opponent = row.select_one('.sidearm-schedule-game-opponent-logo img')['alt'].rsplit(maxsplit=1)[0]
    name_date = row.select_one('.sidearm-schedule-game-opponent-name a')['aria-label']

    result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0]

    data.append([opponent, *result, name_date])

df = pd.DataFrame(data, columns=['Name', 'Result', 'Score', 'OT', 'Info'])
print(df)
印刷品:

                            Name Result Score      OT                                             Info
0      University of Connecticut      L   1-2                                UConn on August 24 7 p.m.
1              Drexel University      L   1-2    (OT)                       Drexel on August 27 7 p.m.
2   George Washington University      W   1-0                  George Washington on September 1 4 p.m.
3          St. John's University      W   1-0                      St. John's on September 4 7:30 p.m.
4          Binghamton University      L   1-2                         Binghamton on September 7 8 p.m.
5               Rider University      W   1-0  (2 OT)                     Rider on September 11 7 p.m.
6     University of Pennsylvania      T   0-0  (2 OT)                      Penn on September 15 6 p.m.
7                           Army      W   3-0                              Army on September 22 7 p.m.
8             Cornell University      L   2-3    (OT)                   Cornell on September 25 7 p.m.
9              Boston University      W   2-1    (OT)                  Boston U on September 29 4 p.m.
10            Colgate University      W   1-0                              Colgate on October 3 7 p.m.
11   United States Naval Academy      W   1-0                                 Navy on October 6 6 p.m.
12             Lafayette College      L   0-1                          Lafayette on October 13 12 p.m.
13             Dartmouth College      T   0-0  (2 OT)                   Dartmouth on October 16 6 p.m.
14           American University      L   0-1                            American on October 20 6 p.m.
15           Bucknell University      W   1-0                            Bucknell on October 24 7 p.m.
16       Loyola University (Md.)      L   0-1                        Loyola (Md.) on October 27 3 p.m.
17                    Holy Cross      W   3-1                          Holy Cross on November 3 6 p.m.
18            Colgate University      L   1-2          No. 3 Colgate (Semifinals) on November 9 7 p.m.