Python中的解析表标记
我正在尝试使用python从HTML文件中提取数据。我正在尝试从文件中提取表内容 以下是表格的HTML内容:Python中的解析表标记,python,beautifulsoup,Python,Beautifulsoup,我正在尝试使用python从HTML文件中提取数据。我正在尝试从文件中提取表内容 以下是表格的HTML内容: <table class="radiobutton" id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay" onclick="return false;"> <tbody> <tr> <td> <input id="c
<table class="radiobutton" id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay" onclick="return false;">
<tbody>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_0" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="1" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_0">Fitting</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_1" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="2" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_1">Material</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_2" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="4" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_2">Appliance</label>
</td>
</tr>
<tr>
<td>
<input checked="checked" id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_3" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="8" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_3">Apparatus</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_4" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="16" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_4">Other procedures</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_5" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="32" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_5">Alternative fuel oils</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_6" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="64" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_6">Other compliance method:</label>
</td>
</tr>
</tbody>
</table>
如何打印单选按钮的标签以及选中的属性
示例:对于下面的标签,它应该打印:Fitting
和“选中”以下所述的输入标签:
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_0">Fitting</label>
<input checked="checked" id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_3" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="8"/>
配件
以下代码有效,但需要更好的解决方案:
from bs4 import BeautifulSoup
from pyparsing import makeHTMLTags
with open('.\ABC.html', 'r') as read_file:
data = read_file.read()
soup = BeautifulSoup(data, 'html.parser')
table = soup.find("table", attrs={"id":"ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay"})
spotterTag, spotterEndTag = makeHTMLTags("input")
for spotter in spotterTag.searchString(table):
if spotter.checked == 'checked':
label = soup.find("label", attrs={"for":spotter.id})
print(str(label)[str(label).find('>')+1:str(label).find('<',2)])
print(spotter.checked)
从bs4导入美化组
从pyparsing导入makeHTMLTags
打开('.\ABC.html',r')作为读取文件:
data=read_file.read()
soup=BeautifulSoup(数据'html.parser')
table=soup.find(“table”,attrs={“id”:“ctl00\u bodyPlaceHolder\u ctl00\u Reg\u rblTypeDisplay”})
spotterTag,spotterEndTag=makeHTMLTags(“输入”)
对于spotterTag.searchString(表)中的观察者:
如果spotter.checked==“checked”:
label=soup.find(“label”,attrs={“for”:spotter.id})
print(str(label)[str(label).find('>')+1:str(label).find('我不确定是否理解正确,但是否要将输入和标签压缩在一起?如果是,可以使用zip()
函数。例如(数据
是HTML字符串):
from bs4 import BeautifulSoup
from pyparsing import makeHTMLTags
with open('.\ABC.html', 'r') as read_file:
data = read_file.read()
soup = BeautifulSoup(data, 'html.parser')
table = soup.find("table", attrs={"id":"ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay"})
spotterTag, spotterEndTag = makeHTMLTags("input")
for spotter in spotterTag.searchString(table):
if spotter.checked == 'checked':
label = soup.find("label", attrs={"for":spotter.id})
print(str(label)[str(label).find('>')+1:str(label).find('<',2)])
print(spotter.checked)
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
print('{:^25} {:^15} {:^15}'.format('Text', 'Value', 'Checked'))
for inp, lbl in zip(soup.select('table#ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay input'),
soup.select('table#ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay label')):
print('{:<25} {:^15} {:^15}'.format(lbl.text, inp['value'], 'checked' if 'checked' in inp.attrs else '-'))
Text Value Checked
Fitting 1 -
Material 2 -
Appliance 4 -
Apparatus 8 checked
Other procedures 16 -
Alternative fuel oils 32 -
Other compliance method: 64 -