Python 使用BeautifulSoup在网页的表格中拾取文本_Python_Beautifulsoup

Python 使用BeautifulSoup在网页的表格中拾取文本

python

Python 使用BeautifulSoup在网页的表格中拾取文本,python,beautifulsoup,Python,Beautifulsoup,我想使用BeautifulSoup从公司网页上获取“型号”值，这些值来自以下代码：它形成了两个并排显示在网页上的表格更新了网页的源代码 <TR class=tableheader> <TD width="12%"> </TD> <TD style="TEXT-ALIGN: left" width="12%">Group </TD> <TD style="TEXT-ALIGN: left" width="15%

我想使用BeautifulSoup从公司网页上获取“型号”值，这些值来自以下代码：

它形成了两个并排显示在网页上的表格

更新了网页的源代码

<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>

我怎样才能得到它？我正在使用Python 2.7。

您正在查找下一行，然后是位于相同位置的下一个单元格。后者很棘手；我们可以假设它总是第三列：

header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()

如果您只是询问下一个

td

，则会得到

Design Year

列

很可能有更好的方法到达你的一个细胞；例如，如果我们假设只有一个

tr

行具有类

row1

，则以下内容将在一个步骤中获得您的值：

value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()

您正在查找下一行，然后是位于相同位置的下一个单元格。后者很棘手；我们可以假设它总是第三列：

header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()

如果您只是询问下一个

td

，则会得到

Design Year

列

很可能有更好的方法到达你的一个细胞；例如，如果我们假设只有一个

tr

行具有类

row1

，则以下内容将在一个步骤中获得您的值：

value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()

您正在查找下一行，然后是位于相同位置的下一个单元格。后者很棘手；我们可以假设它总是第三列：

header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()

如果您只是询问下一个

td

，则会得到

Design Year

列

很可能有更好的方法到达你的一个细胞；例如，如果我们假设只有一个

tr

行具有类

row1

，则以下内容将在一个步骤中获得您的值：

value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()

您正在查找下一行，然后是位于相同位置的下一个单元格。后者很棘手；我们可以假设它总是第三列：

header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()

如果您只是询问下一个

td

，则会得到

Design Year

列

很可能有更好的方法到达你的一个细胞；例如，如果我们假设只有一个

tr

行具有类

row1

，则以下内容将在一个步骤中获得您的值：

value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()

我认为你可以做以下几点：

from bs4 import BeautifulSoup

html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>

<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""

soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})

dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
    dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()

print dico['Model Type']

从bs4导入美化组
html=“”设计者
性别
男性
出生国
判定元件
评论
团体
模型类型
设计年
西南部
VIP QB662FG（注册）
2013年（注册）”
soup=BeautifulSoup（html，“html.parser”）
soup=soup.find（'table'，{'class'：'tableforms'}）
dico={}
l1=soup.findAll（'tr'）[1]。findAll（'td'））
l2=soup.findAll（'tr'）[2]。findAll（'td'））
对于范围内的i（len（l1））：
dico[l1[i].getText（）.strip（）]=l2[i].getText（）.replace（‘（已注册）’，’）.strip（）
打印dico['Model Type']

它打印：

u'VIP QB662FG'

我认为您可以执行以下操作：

from bs4 import BeautifulSoup

html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>

<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""

soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})

dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
    dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()

print dico['Model Type']

从bs4导入美化组
html=“”设计者
性别
男性
出生国
判定元件
评论
团体
模型类型
设计年
西南部
VIP QB662FG（注册）
2013年（注册）”
soup=BeautifulSoup（html，“html.parser”）
soup=soup.find（'table'，{'class'：'tableforms'}）
dico={}
l1=soup.findAll（'tr'）[1]。findAll（'td'））
l2=soup.findAll（'tr'）[2]。findAll（'td'））
对于范围内的i（len（l1））：
dico[l1[i].getText（）.strip（）]=l2[i].getText（）.replace（‘（已注册）’，’）.strip（）
打印dico['Model Type']

它打印：

u'VIP QB662FG'

我认为您可以执行以下操作：

from bs4 import BeautifulSoup

html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>

<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""

soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})

dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
    dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()

print dico['Model Type']

从bs4导入美化组
html=“”设计者
性别
男性
出生国
判定元件
评论
团体
模型类型
设计年
西南部
VIP QB662FG（注册）
2013年（注册）”
soup=BeautifulSoup（html，“html.parser”）
soup=soup.find（'table'，{'class'：'tableforms'}）
dico={}
l1=soup.findAll（'tr'）[1]。findAll（'td'））
l2=soup.findAll（'tr'）[2]。findAll（'td'））
对于范围内的i（len（l1））：
dico[l1[i].getText（）.strip（）]=l2[i].getText（）.replace（‘（已注册）’，’）.strip（）
打印dico['Model Type']

它打印：

u'VIP QB662FG'

我认为您可以执行以下操作：

from bs4 import BeautifulSoup

html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%">&nbsp;</TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>

<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""

soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})

dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
    dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()

print dico['Model Type']

从bs4导入美化组
html=“”设计者
性别
男性
出生国
判定元件
评论
团体
模型类型
设计年
西南部
VIP QB662FG（注册）
2013年（注册）”
soup=BeautifulSoup（html，“html.parser”）
soup=soup.find（'table'，{'class'：'tableforms'}）
dico={}
l1=soup.findAll（'tr'）[1]。findAll（'td'））
l2=soup.findAll（'tr'）[2]。findAll（'td'））
对于范围内的i（len（l1））：
dico[l1[i].getText（）.strip（）]=l2[i].getText（）.replace（‘（已注册）’，’）.strip（）
打印dico['Model Type']

它打印：

u'VIP QB662FG'

查找所有tr并输出第三个子项，除非它是第一行

import bs4    
data = """
<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in  enumerate(table.findChildren()):
    if i>0:
        for idx,td in enumerate(tr.findChildren()):
            if idx==2:
                print td.get_text().replace('(Registered)','').strip()

导入bs4
data=”“”
团体
模型类型
设计年
西南部
VIP QB662FG（注册）
2013年（注册）
"""
soup=bs4.BeautifulSoup（数据）
#table=soup.find（'tr'，{'class'：'tableheader'}）.parent
table=soup.find（'table'，{'class'：'tableforms'}）
对于i，枚举中的tr（table.findChildren（））：
如果i>0：
对于idx，枚举中的td（tr.findChildren（））：
如果idx==2：
打印td.get_text（）

查找所有tr并输出其第三个子项，除非它是第一行

import bs4    
data = """
<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in  enumerate(table.findChildren()):
    if i>0:
        for idx,td in enumerate(tr.findChildren()):
            if idx==2:
                print td.get_text().replace('(Registered)','').strip()

导入bs4
data=”“”
团体
模型类型
设计年
西南部
VIP QB662FG（注册）
2013年（注册）
"""
soup=bs4.BeautifulSoup（数据）
#table=soup.find（'tr'，{'class'：'tableheader'}）.parent
table=soup.find（'table'，{'class'：'tableforms'}）
对于i，枚举中的tr（table.findChildren（））：
如果i>0：
对于idx，枚举中的td（tr.findChildren（））：
如果idx==2：
打印td.get_text（）

查找所有tr并输出其第三个子项，除非它是第一行

import bs4    
data = """
<TR class=tableheader>
<TD width="12%">&nbsp;</TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%">&nbsp;</TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in  enumerate(table.findChildren()):
    if i>0:
        for idx,td in enumerate(tr.findChildren()):
            if idx==2:
                print td.get_text().replace('(Registered)','').strip()

导入bs4
data=”“”
团体
模型类型
设计年
西南部
VIP QB662FG（注册）
2013年（注册）
"""
soup=bs4.BeautifulSoup（数据）
#表格