Web scraping BeautifulSoup无法分析表中的内容

Web scraping BeautifulSoup无法分析表中的内容,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我试图从链接中的表中提取数据。 这是我尝试过的,但都是空白 wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options) wd.get("https://www.chp.ca.gov/traffic") html = wd.page_source soup = BeautifulSoup(html, "lxml") l = [] div = soup.find("div" , {"id": "pnlIncidents"}

我试图从链接中的表中提取数据。

这是我尝试过的,但都是空白

wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://www.chp.ca.gov/traffic")
html = wd.page_source
soup = BeautifulSoup(html, "lxml")
l = []
div = soup.find("div" , {"id": "pnlIncidents"})
table = div.find("table", {"id":"gvIncidents"})
​
for row in table.findAll(a):
    l.append(row.text)
HTML


详细信息无时间类型位置位置描述区域
000829:35 AMHit and Run w/injuresnb Sr99 Jno Merle Haggard DrNB Sr99 Jno Merle Haggard DR贝克斯菲尔德
0000212:00 AMTraffic AdvisoryBakersfield Traffic Advisories Bakersfield Traffic Advisories BF
0009111:02 AM关闭塞罗-诺罗埃斯特路/克里普斯坦峡谷路-特戎堡的一条道路
0007410:15上午10时15分,关闭克利普斯坦峡谷路/Sr166号Buttonwillow的一条道路
0007310:14上午,关闭特戎堡波特雷罗公路/塞罗诺罗埃斯特路的一条道路

我已将请求代码作为注释放置,您可以取消注释以直接从网站获取数据。但由于我所在的位置无法访问该网站,我已将其用于您的HTML,如下所示:-

#import requests
import pandas as pd

html = ''' 
<div id="pnlIncidents" style="overflow-y:scroll;">


                    <div>
        <table tabindex="1" cellspacing="0" rules="rows" border="1" id="gvIncidents" style="border-collapse:collapse;">
            <tbody><tr class="gvHeader" style="white-space:nowrap;">
                <th tabindex="1" scope="col">Details</th><th tabindex="1" scope="col">No.</th><th tabindex="1" scope="col" style="white-space:nowrap;">Time</th><th tabindex="1" scope="col">Type</th><th tabindex="1" scope="col">Location</th><th tabindex="1" scope="col">Location Desc.</th><th tabindex="1" scope="col">Area</th>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$0')">Details</a></td><td>00082</td><td style="white-space:nowrap;">9:35 AM</td><td>Hit and Run w/Injuries</td><td>Nb Sr99 Jno Merle Haggard Dr</td><td>NB SR99 JNO Merle Haggard Dr</td><td>Bakersfield</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$1')">Details</a></td><td>00002</td><td style="white-space:nowrap;">12:00 AM</td><td>Traffic Advisory</td><td>Bakersfield Traffic Advisories</td><td>Bakersfield Traffic Advisories</td><td>BF</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$2')">Details</a></td><td>00091</td><td style="white-space:nowrap;">11:02 AM</td><td>CLOSURE of a Road</td><td>Cerro Noroeste Rd / Klipstein Canyon Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$3')">Details</a></td><td>00074</td><td style="white-space:nowrap;">10:15 AM</td><td>CLOSURE of a Road</td><td>Klipstein Canyon Rd / Sr166</td><td>&nbsp;</td><td>Buttonwillow</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$4')">Details</a></td><td>00073</td><td style="white-space:nowrap;">10:14 AM</td><td>CLOSURE of a Road</td><td>Mil Potrero Hwy / Cerro Noroeste Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr>
        </tbody></table>
    </div>


</div>
'''
tables = pd.read_html(html)

#url = 'Enter your URL'
#html = requests.get(url).content 
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
#导入请求
作为pd进口熊猫
html=“”
详细信息无时间类型位置位置描述区域
000829:35 AMHit and Run w/injuresnb Sr99 Jno Merle Haggard DrNB Sr99 Jno Merle Haggard DR贝克斯菲尔德
0000212:00 AMTraffic AdvisoryBakersfield Traffic Advisories Bakersfield Traffic Advisories BF
0009111:02 AM关闭塞罗-诺罗埃斯特路/克里普斯坦峡谷路-特戎堡的一条道路
0007410:15上午10时15分,关闭克利普斯坦峡谷路/Sr166号Buttonwillow的一条道路
0007310:14上午,关闭特戎堡波特雷罗公路/塞罗诺罗埃斯特路的一条道路
'''
tables=pd.read\u html(html)
#url='输入您的url'
#html=requests.get(url.content)
df_list=pd.read_html(html)
df=df_列表[-1]
打印(df)
使用美化组进行编辑

from bs4 import BeautifulSoup
html = ''' 
<div id="pnlIncidents" style="overflow-y:scroll;">


                    <div>
        <table tabindex="1" cellspacing="0" rules="rows" border="1" id="gvIncidents" style="border-collapse:collapse;">
            <tbody><tr class="gvHeader" style="white-space:nowrap;">
                <th tabindex="1" scope="col">Details</th><th tabindex="1" scope="col">No.</th><th tabindex="1" scope="col" style="white-space:nowrap;">Time</th><th tabindex="1" scope="col">Type</th><th tabindex="1" scope="col">Location</th><th tabindex="1" scope="col">Location Desc.</th><th tabindex="1" scope="col">Area</th>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$0')">Details</a></td><td>00082</td><td style="white-space:nowrap;">9:35 AM</td><td>Hit and Run w/Injuries</td><td>Nb Sr99 Jno Merle Haggard Dr</td><td>NB SR99 JNO Merle Haggard Dr</td><td>Bakersfield</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$1')">Details</a></td><td>00002</td><td style="white-space:nowrap;">12:00 AM</td><td>Traffic Advisory</td><td>Bakersfield Traffic Advisories</td><td>Bakersfield Traffic Advisories</td><td>BF</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$2')">Details</a></td><td>00091</td><td style="white-space:nowrap;">11:02 AM</td><td>CLOSURE of a Road</td><td>Cerro Noroeste Rd / Klipstein Canyon Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$3')">Details</a></td><td>00074</td><td style="white-space:nowrap;">10:15 AM</td><td>CLOSURE of a Road</td><td>Klipstein Canyon Rd / Sr166</td><td>&nbsp;</td><td>Buttonwillow</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$4')">Details</a></td><td>00073</td><td style="white-space:nowrap;">10:14 AM</td><td>CLOSURE of a Road</td><td>Mil Potrero Hwy / Cerro Noroeste Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr>
        </tbody></table>
    </div>


</div>
'''



soup = BeautifulSoup(html, "html.parser")
tables = soup.find('table')
table_rows = tables.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)
从bs4导入美化组
html=“”
详细信息无时间类型位置位置描述区域
000829:35 AMHit and Run w/injuresnb Sr99 Jno Merle Haggard DrNB Sr99 Jno Merle Haggard DR贝克斯菲尔德
0000212:00 AMTraffic AdvisoryBakersfield Traffic Advisories Bakersfield Traffic Advisories BF
0009111:02 AM关闭塞罗-诺罗埃斯特路/克里普斯坦峡谷路-特戎堡的一条道路
0007410:15上午10时15分,关闭克利普斯坦峡谷路/Sr166号Buttonwillow的一条道路
0007310:14上午,关闭特戎堡波特雷罗公路/塞罗诺罗埃斯特路的一条道路
'''
soup=BeautifulSoup(html,“html.parser”)
tables=soup.find('table')
table_rows=tables.find_all('tr')
res=[]
对于表_行中的tr:
td=tr.find_all('td'))
row=[tr.text.strip(),如果tr.text.strip(),则td中的tr为tr
如果行:
res.append(行)

您提到的网站目前已关闭。你有html吗?@PrakharJhudele,我已经更新了问题,它有html部分谢谢!但是,我需要使用BeautifulSoup来刮取数据。如果您觉得问题已经解决,请单击我答案左侧的复选标记来接受答案。请随意投票选出答案。
from bs4 import BeautifulSoup
html = ''' 
<div id="pnlIncidents" style="overflow-y:scroll;">


                    <div>
        <table tabindex="1" cellspacing="0" rules="rows" border="1" id="gvIncidents" style="border-collapse:collapse;">
            <tbody><tr class="gvHeader" style="white-space:nowrap;">
                <th tabindex="1" scope="col">Details</th><th tabindex="1" scope="col">No.</th><th tabindex="1" scope="col" style="white-space:nowrap;">Time</th><th tabindex="1" scope="col">Type</th><th tabindex="1" scope="col">Location</th><th tabindex="1" scope="col">Location Desc.</th><th tabindex="1" scope="col">Area</th>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$0')">Details</a></td><td>00082</td><td style="white-space:nowrap;">9:35 AM</td><td>Hit and Run w/Injuries</td><td>Nb Sr99 Jno Merle Haggard Dr</td><td>NB SR99 JNO Merle Haggard Dr</td><td>Bakersfield</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$1')">Details</a></td><td>00002</td><td style="white-space:nowrap;">12:00 AM</td><td>Traffic Advisory</td><td>Bakersfield Traffic Advisories</td><td>Bakersfield Traffic Advisories</td><td>BF</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$2')">Details</a></td><td>00091</td><td style="white-space:nowrap;">11:02 AM</td><td>CLOSURE of a Road</td><td>Cerro Noroeste Rd / Klipstein Canyon Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$3')">Details</a></td><td>00074</td><td style="white-space:nowrap;">10:15 AM</td><td>CLOSURE of a Road</td><td>Klipstein Canyon Rd / Sr166</td><td>&nbsp;</td><td>Buttonwillow</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$4')">Details</a></td><td>00073</td><td style="white-space:nowrap;">10:14 AM</td><td>CLOSURE of a Road</td><td>Mil Potrero Hwy / Cerro Noroeste Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr>
        </tbody></table>
    </div>


</div>
'''



soup = BeautifulSoup(html, "html.parser")
tables = soup.find('table')
table_rows = tables.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)