Python 提取介于之间的文本<；br/>；使用BeautifulSoup分隔panda列的标记_Python_Html_Pandas_Beautifulsoup

Python 提取介于之间的文本<；br/>；使用BeautifulSoup分隔panda列的标记

python html pandas

Python 提取介于之间的文本<；br/>；使用BeautifulSoup分隔panda列的标记,python,html,pandas,beautifulsoup,Python,Html,Pandas,Beautifulsoup,我有一个HTML数据表scrape（见下面的示例），我正试图将其保存到panda df中。我可以成功地提取每一行，并将每个HTML列解析为df中的一个单独的列（请参见下面的代码）。我遇到的问题是，在某些列中，有多个数据项由或分隔。由或分隔的每个元素应进入其单独的df列（参见下面的当前和所需df列）。例如，当前代码将日期和时间数据输出为09.10.201918:5020:25，而不是将日期09.10.2019、出发时间18:50和到达时间20:25分隔到df中各自的列中示例HTML行 <t

我有一个HTML数据表scrape（见下面的示例），我正试图将其保存到panda df中。我可以成功地提取每一行，并将每个HTML

列解析为df中的一个单独的列（请参见下面的代码）。我遇到的问题是，在某些列中，有多个数据项由

或

分隔。由

或

分隔的每个元素应进入其单独的df列（参见下面的当前和所需df列）。例如，当前代码将日期和时间数据输出为

09.10.201918:5020:25

，而不是将日期

09.10.2019

、出发时间

18:50

和到达时间

20:25

分隔到df中各自的列中

示例HTML行

<tr valign="top"><td align="right" class="liste_gross">548<br/></td><td class="liste"><nobr>02.01.2018</nobr>
<br/>08:45<br/>14:55 </td><td class="liste_gross"><b>MEL</b><br/></td><td class="liste"><b>Melbourne</b><br/>
Australia<br/>Tullamarine</td><td class="liste_gross"><b>HKG</b><br/></td><td class="liste"><b>Hong Kong</b><br/>
China<br/>International</td><th align="right" class="liste_gross"><table border="0" cellpadding="0" cellspacing="0">
<tr><td align="right">7,420 </td><td>km</td></tr><tr><td align="right">9:25 </td><td>h</td></tr></table>
</th><td class="liste">Cathay Pacific<br/>CX34</td><td class="liste">A350-900<br/>B-LRR</td>
<td class="liste">32A/Window<br/><small>EconomyPlus<br/>Passenger<br/>Personal</small></td><td class="liste">
<br/><select onchange="if (this.value != 'NIL') location.href=this.value;" style="width:60px;">
<option value="NIL">Flight</option><option value="?go=flugdaten_edit&amp;id=14619399&amp;dbpos=0">edit</option>
<option value="NIL">----------</option><option value="?go=flugdaten_loeschen&amp;id=14619399&amp;dbpos=0">delete
</option></select></td></tr>

所需的列名列表

dftable = pd.DataFrame(data, columns = ['flightno', 'date', 'timedept', 'timearr', 'codedept',
                                'citydept', 'countrydept', 'namedept', 'codearr', 'cityarr', 'countryarr',
                                'namearr', 'dist', 'distunits', 'time', 'timeunits', 'airline', 'flightno',
                                'manuf', 'type', 'rego', 'seat', 'loc', 'class', 'pass', 'reason', 'inputcol'])

这是一个regex+SimplifiedDoc解决方案

import re
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html='''<table cellspacing="2"><tr valign="top"><td align="right" class="liste_gross">548<br/></td><td class="liste"><nobr>02.01.2018</nobr>
<br/>08:45<br/>14:55 </td><td class="liste_gross"><b>MEL</b><br/></td><td class="liste"><b>Melbourne</b><br/>
Australia<br/>Tullamarine</td><td class="liste_gross"><b>HKG</b><br/></td><td class="liste"><b>Hong Kong</b><br/>
China<br/>International</td><th align="right" class="liste_gross"><table border="0" cellpadding="0" cellspacing="0">
<tr><td align="right">7,420 </td><td>km</td></tr><tr><td align="right">9:25 </td><td>h</td></tr></table>
</th><td class="liste">Cathay Pacific<br/>CX34</td><td class="liste">A350-900<br/>B-LRR</td>
<td class="liste">32A/Window<br/><small>EconomyPlus<br/>Passenger<br/>Personal</small></td><td class="liste">
<br/><select onchange="if (this.value != 'NIL') location.href=this.value;" style="width:60px;">
<option value="NIL">Flight</option><option value="?go=flugdaten_edit&amp;id=14619399&amp;dbpos=0">edit</option>
<option value="NIL">----------</option><option value="?go=flugdaten_loeschen&amp;id=14619399&amp;dbpos=0">delete
</option></select></td></tr></table>
'''
doc = SimplifiedDoc(html)
table = doc.getElement('table',attr="cellspacing",value="2")
rows = table.trs # get all rows
data = []
for row in rows:
  arr = []
  # cols = row.tds # get all tds
  cols = row.children # td and th
  i = 0
  while i<len(cols):
    if i==1: # for example
      items = re.split('<br\s*/>',cols[i].html)
      for item in items:
        arr.append(doc.removeHtml(item))
    elif cols[i].tag=='th': # deal it by yourself
      tds = cols[i].tds
      print (tds)
    else:
      arr.append(cols[i].text)
    i+=1
  data.append(arr)
print (data) # [['548', '02.01.2018', '08:45', '14:55', 'MEL', 'MelbourneAustraliaTullamarine', 'HKG', 'Hong KongChinaInternational', 'Cathay PacificCX34', 'A350-900B-LRR', '32A/WindowEconomyPlusPassengerPersonal', 'Flightedit----------delete']]

重新导入
从simplified_scrapy.simplified_doc导入SimplifiedDoc
html='''548
2018年1月2日

08:45
14:55墨尔本

澳大利亚
TullamarineHKG
香港

中国
国际
7420公里9:25小时
国泰航空公司CX34A350-900
32A/车窗
经济舱
乘客舱
个人舱


飞行编辑
----------删除
'''
doc=SimplifiedDoc（html）
table=doc.getElement（'table'，attr=“cellspacing”，value=“2”）
rows=table.trs#获取所有行
数据=[]
对于行中的行：
arr=[]
#cols=row.tds#获取所有tds
cols=行。子项#td和th
i=0
虽然我也许这可以帮助您…您可以使用pd.read\u html解析html表，如下所示：
from bs4 import BeautifulSoup
import pandas as pd
import re

soup = BeautifulSoup(open("table.html"), "lxml")
# Replace <br> by | ...
s = re.sub('<br\s*/>','|', str(soup))

df_table = pd.read_html(s)
# To dataframe
df_table=df_table[0]
df_table.columns = ['flightno', 'fulldate','codedept','full_dept', 'countrydept', 'full_arr', 'KM', 'date_plane', 'date_plane_2','date_pass', 'inputcol']

#Split columns using value |
df_table[['date','timedept','timearr']] = df_table['fulldate'].str.split('|', expand=True)
df_table[['citydept','countrydept','namedept']] = df_table['full_dept'].str.split('|', expand=True)
df_table[['cityarr','countryarr','namearr']] = df_table['full_arr'].str.split('|', expand=True)
df_table[['airline','flightno']] = df_table['date_plane'].str.split('|', expand=True)
df_table[['manuf','type']] = df_table['date_plane_2'].str.split('|', expand=True)
df_table[['full_seat','class','pass','reason']] = df_table['date_pass'].str.split('|', expand=True)
df_table[['seat', 'loc']] = df_table['full_seat'].str.split('/', expand=True)
#Drop columns not necessary
df_table.drop(['fulldate','full_dept','full_arr','date_plane','date_plane_2','date_pass','full_seat'], axis=1, inplace=True)
#print(df_table)
df_table.to_csv('table_to_csv.csv')

从bs4导入美化组
作为pd进口熊猫
进口稀土
soup=BeautifulSoup（打开（“table.html”），“lxml”）
#用|替换
。。。
s=re.sub（“”，|’，str（汤））
df_table=pd.read_html
#到数据帧
df_表=df_表[0]
df_table.columns=['flightno'、'fulldate'、'codedept'、'full_dept'、'countrydept'、'full_arr'、'KM'、'date_plane_2'、'date_pass'、'inputcol']
#使用值拆分列|
df_表[['date'，'timedept'，'timearr']=df_表['fulldate'].str.split（'124;'，expand=True）
df_表[['citydept'，'countrydept'，'namedept']]=df_表['full_dept'].str.split（'124;'，expand=True）
df_表[['cityarr'，'countryarr'，'namearr']]=df_表['full_arr'].str.split（'124;'，expand=True）
df_表[['airline'，'flightno']]=df_表['date_plane'].str.split（'124;'，expand=True）
df_表[['manuf'，'type']]=df_表['date_plane_2'].str.split（'124;'，expand=True）
df_表[['full_seat'，'class'，'pass'，'reason']=df_表['date_pass'].str.split（'124;'，expand=True）
df_表[['seat'，'loc']]=df_表['full_seat']]。str.split（'/'，expand=True）
#不需要删除列
df_table.drop（['fulldate'、'full_dept'、'full_arr'、'date_plane'、'date_plane_2'、'date_pass'、'full_seat']，轴=1，在位=True）
#打印（df_表）
df_table.to_csv（'table_to_csv.csv'））

table.html包含：
<!DOCTYPE html>
<html>
<body>
<table  border='1'>
<tr valign="top"><td align="right" class="liste_gross">548<br/></td><td class="liste"><nobr>02.01.2018</nobr>
<br/>08:45<br/>14:55 </td><td class="liste_gross"><b>MEL</b><br/></td><td class="liste"><b>Melbourne</b><br/>
Australia<br/>Tullamarine</td><td class="liste_gross"><b>HKG</b><br/></td><td class="liste"><b>Hong Kong</b><br/>
China<br/>International</td><th align="right" class="liste_gross"><table border="0" cellpadding="0" cellspacing="0">
<tr><td align="right">7,420 </td><td>km</td></tr><tr><td align="right">9:25 </td><td>h</td></tr></table>
</th><td class="liste">Cathay Pacific<br/>CX34</td><td class="liste">A350-900<br/>B-LRR</td>
<td class="liste">32A/Window<br/><small>EconomyPlus<br/>Passenger<br/>Personal</small></td><td class="liste">
<br/><select onchange="if (this.value != 'NIL') location.href=this.value;" style="width:60px;">
<option value="NIL">Flight</option><option value="?go=flugdaten_edit&amp;id=14619399&amp;dbpos=0">edit</option>
<option value="NIL">----------</option><option value="?go=flugdaten_loeschen&amp;id=14619399&amp;dbpos=0">delete
</option></select></td></tr></table>
</body>
</html>


548
2018年1月2日

08:45
14:55墨尔本

澳大利亚
TullamarineHKG
香港

中国
国际
7420公里9:25小时
国泰航空公司CX34A350-900
32A/车窗
经济舱
乘客舱
个人舱


飞行编辑
----------删除
熊猫不能直接读取HTML表格吗？你能发布你想要的输出吗？谢谢-这让我达到了目的。然后，我添加了一些错误检查项，因为原始数据中的工件在示例中不存在。我只是提供了另一种方式
<!DOCTYPE html>
<html>
<body>
<table  border='1'>
<tr valign="top"><td align="right" class="liste_gross">548<br/></td><td class="liste"><nobr>02.01.2018</nobr>
<br/>08:45<br/>14:55 </td><td class="liste_gross"><b>MEL</b><br/></td><td class="liste"><b>Melbourne</b><br/>
Australia<br/>Tullamarine</td><td class="liste_gross"><b>HKG</b><br/></td><td class="liste"><b>Hong Kong</b><br/>
China<br/>International</td><th align="right" class="liste_gross"><table border="0" cellpadding="0" cellspacing="0">
<tr><td align="right">7,420 </td><td>km</td></tr><tr><td align="right">9:25 </td><td>h</td></tr></table>
</th><td class="liste">Cathay Pacific<br/>CX34</td><td class="liste">A350-900<br/>B-LRR</td>
<td class="liste">32A/Window<br/><small>EconomyPlus<br/>Passenger<br/>Personal</small></td><td class="liste">
<br/><select onchange="if (this.value != 'NIL') location.href=this.value;" style="width:60px;">
<option value="NIL">Flight</option><option value="?go=flugdaten_edit&amp;id=14619399&amp;dbpos=0">edit</option>
<option value="NIL">----------</option><option value="?go=flugdaten_loeschen&amp;id=14619399&amp;dbpos=0">delete
</option></select></td></tr></table>
</body>
</html>