Python3-如何提取行标记中的所有元素<;tr>;并将它们作为行附加到数据帧?
我试图从html表中提取行,并将它们附加到数据框或直接添加到Excel电子表格中 我希望保持表的原始结构,因为它映射了矩阵系统的物理布局。例如,我试图提取的数据如下表所示Python3-如何提取行标记中的所有元素<;tr>;并将它们作为行附加到数据帧?,python,html,pandas,dataframe,beautifulsoup,Python,Html,Pandas,Dataframe,Beautifulsoup,我试图从html表中提取行,并将它们附加到数据框或直接添加到Excel电子表格中 我希望保持表的原始结构,因为它映射了矩阵系统的物理布局。例如,我试图提取的数据如下表所示 <div id="FA_DSC"><p>Table_Title</p><table border="1" cellpadding="4"style="border: 1px solid #000000; border-collapse: collapse;"> <tr>
<div id="FA_DSC"><p>Table_Title</p><table border="1" cellpadding="4"style="border: 1px solid #000000; border-collapse: collapse;">
<tr>
<td> </td>
<td> </td>
<td>X68</td>
<td>X20</td>
<td>X17</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>X80</td>
<td>X84</td>
<td>V28</td>
<td>X02</td>
<td>X12</td>
<td> </td>
</tr>
<tr>
<td>X22</td>
<td>X55</td>
<td>V57</td>
<td>U15</td>
<td>V29</td>
<td>X51</td>
<td>X40</td>
</tr>
</table></div>
使用BeautifulSoup,我可以通过以下方法找到我想要提取的所有表
with open(r'D:\yolo\frolo\dolo.html','r') as f:
contents = f.read()
soup = BeautifulSoup(contents.encode("UTF8"),'lxml')
table = soup.find_all('div',{'id':'table'})
从这里开始,我尝试提取for i in table:
for k in i:
text = i.get_text().split('\n')
print(text)
但是将返回这样的迭代
['Table_Title']
['', '', ' ', ' ', 'X68', 'X20', 'X17', ' ', ' ',
'', '', ' ', 'X80', 'X84', 'V28', 'X02', 'X12', ' ',
'', '', 'X22', 'X55', 'V57', 'U15', 'V29', 'X51', 'X40',
'', '', 'X14', 'W05', 'T34', 'U36', 'T38', 'S75', 'X24',
'', '', 'X83', 'X57', 'U48', 'V10', 'T82', 'X04', 'X11',
'', '', ' ', 'X82', 'X59', 'T39', 'X03', 'X18', ' ', '',
'', ' ', ' ', 'X78', 'X15', 'X41', ' ', ' ', '', '']
我也试过,
table.find_all('td')
它返回
AttributeError: ResultSet object has no attribute 'find_all'.
You're probably treating a list of items like a single item.
Did you call find_all() when you meant to call find()?
我得到的最接近的数据是使用
k.contents
当我尝试使用正则表达式时
print(re.findall("<tr>(.*?)</tr>", "".join(k.contents)))
总之,这是我的初始代码,我希望能从中获得一些指导
with open(r'D:\yolo\frolo\dolo.html','r') as f:
contents = f.read()
soup = BeautifulSoup(contents.encode("UTF8"),'lxml')
table = soup.find_all('div',{'id':'table'})
我不熟悉BeautifulSoup和html,希望有人能帮助提取这些行。BeautifulSoup是否具有可用于逐行提取表的功能
希望我清楚地传达了这一点,我为这篇冗长的帖子道歉。只是尝试为每个人提供足够的信息来帮助我解决问题。这将在其自己的列表中存储每个表的数据,并在其自己的列表中存储该表下每行的数据:
from bs4 import BeautifulSoup
html = """
<div id="FA_DSC"><p>Table_Title</p><table border="1" cellpadding="4"style="border: 1px solid #000000; border-collapse: collapse;">
<tr>
<td> </td>
<td> </td>
<td>X68</td>
<td>X20</td>
<td>X17</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>X80</td>
<td>X84</td>
<td>V28</td>
<td>X02</td>
<td>X12</td>
<td> </td>
</tr>
<tr>
<td>X22</td>
<td>X55</td>
<td>V57</td>
<td>U15</td>
<td>V29</td>
<td>X51</td>
<td>X40</td>
</tr>
</table></div>
"""
soup = BeautifulSoup(html, 'lxml')
data = []
for table in soup.select('table'):
table_data = []
data.append(table_data)
for tr in table.select('tr'):
row_data = []
table_data.append(row_data)
for td in tr.select('td'):
row_data.append(td.get_text())
print(data)
你可以用
with open(r'D:\yolo\frolo\dolo.html','r') as f:
contents = f.read()
soup = BeautifulSoup(contents.encode("UTF8"),'lxml')
table = soup.find_all('div',{'id':'table'})
from bs4 import BeautifulSoup
html = """
<div id="FA_DSC"><p>Table_Title</p><table border="1" cellpadding="4"style="border: 1px solid #000000; border-collapse: collapse;">
<tr>
<td> </td>
<td> </td>
<td>X68</td>
<td>X20</td>
<td>X17</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>X80</td>
<td>X84</td>
<td>V28</td>
<td>X02</td>
<td>X12</td>
<td> </td>
</tr>
<tr>
<td>X22</td>
<td>X55</td>
<td>V57</td>
<td>U15</td>
<td>V29</td>
<td>X51</td>
<td>X40</td>
</tr>
</table></div>
"""
soup = BeautifulSoup(html, 'lxml')
data = []
for table in soup.select('table'):
table_data = []
data.append(table_data)
for tr in table.select('tr'):
row_data = []
table_data.append(row_data)
for td in tr.select('td'):
row_data.append(td.get_text())
print(data)
[[[' ', ' ', 'X68', 'X20', 'X17', ' ', ' '], [' ', 'X80', 'X84', 'V28', 'X02', 'X12', ' '], ['X22', 'X55', 'V57', 'U15', 'V29', 'X51', 'X40']]]
import pandas as pd
html="""<div id="FA_DSC"><p>Table_Title</p><table border="1" cellpadding="4"style="border: 1px solid #000000; border-collapse: collapse;">
<tr>
<td> </td>
<td> </td>
<td>X68</td>
<td>X20</td>
<td>X17</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>X80</td>
<td>X84</td>
<td>V28</td>
<td>X02</td>
<td>X12</td>
<td> </td>
</tr>
<tr>
<td>X22</td>
<td>X55</td>
<td>V57</td>
<td>U15</td>
<td>V29</td>
<td>X51</td>
<td>X40</td>
</tr>
</table></div>"""
pd.read_html(html)
0 1 2 3 4 5 6
0 NaN NaN X68 X20 X17 NaN NaN
1 NaN X80 X84 V28 X02 X12 NaN
2 X22 X55 V57 U15 V29 X51 X40