Python 如何在一行中的标题处拆分数据帧

Python 如何在一行中的标题处拆分数据帧,python,pandas,dataframe,split,Python,Pandas,Dataframe,Split,我正在抓取一个页面,大多数表格都是格式标题--info。我可以使用pandas.read_html遍历大多数表,并为所有不同的信息创建单独的数据帧 然而,在有些情况下,他们将信息合并到一个带有副标题的表中,我希望这些副标题是单独的数据帧,该行的文本作为标题(附加列表) 是否有一种简单的方法来拆分此数据帧-它将始终是标题后跟关联行,新标题后跟新关联行 例如 应该是 thing 1 1 1 2 2 2 thing2 4 1 2 5 2 3 6 3 4 如果

我正在抓取一个页面,大多数表格都是格式标题--info。我可以使用pandas.read_html遍历大多数表,并为所有不同的信息创建单独的数据帧

然而,在有些情况下,他们将信息合并到一个带有副标题的表中,我希望这些副标题是单独的数据帧,该行的文本作为标题(附加列表)

是否有一种简单的方法来拆分此数据帧-它将始终是标题后跟关联行,新标题后跟新关联行

例如

应该是

thing
1   1   1
2   2   2

thing2
4   1   2
5   2   3
6   3   4
如果人们只创建有数据意义的网页就好了,但这里不是这样

我尝试过ItErrors,但似乎找不到一个好方法来创建我想要的东西

非常感谢您的帮助

<div class="ranking">
    <h6><a href="javascript:;">Sprint</a></h6>
    <table>
    <tbody>
    
    
    </tbody>
    <tbody>
    <tr>
     <td class="title" colspan="8">Canneto - km 137</td>
    </tr>
    
    <tr>
    <td class="rank"><span title="Rank">1</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">21</span></td>
    <td class="name">
    <a class="10010085859" href="javascript:;">
    <abbr title="Young rider">*</abbr>
    BAGIOLI Nicola
    </a>
    
    </td>
    <td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">5</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">2</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">54</span></td>
    <td class="name">
    <a class="10008688453" href="javascript:;">
    ORSINI Umberto
    </a>
    
    </td>
    <td class="team"><img alt="BARDIANI CSF FAIZANE'" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BCF.png" title="BARDIANI CSF FAIZANE'"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">3</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">3</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">247</span></td>
    <td class="name">
    <a class="10005658114" href="javascript:;">
    ZARDINI Edoardo
    </a>
    
    </td>
    <td class="team"><img alt="VINI ZABU' KTM" src="/Content/images/event/2020/tirreno-adriatico/jerseys/THR.png" title="VINI ZABU' KTM"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">2</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">4</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">63</span></td>
    <td class="name">
    <a class="10003349312" href="javascript:;">
    BODNAR Maciej
    </a>
    
    </td>
    <td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
    
    <td class="noc"><img alt="POL" src="/Content/images/flags/POL.png" title="POL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">1</td>
    
    </tr>
    
    </tbody>
    <tbody>
    <tr>
     <td class="title" colspan="8">Follonica - km 190</td>
    </tr>
    
    <tr>
    <td class="rank"><span title="Rank">1</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">62</span></td>
    <td class="name">
    <a class="10007738055" href="javascript:;">
    ACKERMANN Pascal
    </a>
    
    </td>
    <td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
    
    <td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">12</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">2</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">231</span></td>
    <td class="name">
    <a class="10008656828" href="javascript:;">
    GAVIRIA RENDON Fernando
    </a>
    
    </td>
    <td class="team"><img alt="UAE TEAM EMIRATES" src="/Content/images/event/2020/tirreno-adriatico/jerseys/UAD.png" title="UAE TEAM EMIRATES"/></td>
    
    <td class="noc"><img alt="COL" src="/Content/images/flags/COL.png" title="COL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">10</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">3</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">137</span></td>
    <td class="name">
    <a class="10007506366" href="javascript:;">
    ZABEL Rick
    </a>
    
    </td>
    <td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
    
    <td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">8</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">4</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">91</span></td>
    <td class="name">
    <a class="10008661777" href="javascript:;">
    BALLERINI Davide
    </a>
    
    </td>
    <td class="team"><img alt="DECEUNINCK  -  QUICK - STEP " src="/Content/images/event/2020/tirreno-adriatico/jerseys/DQT.png" title="DECEUNINCK  -  QUICK - STEP "/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">7</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">5</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">12</span></td>
    <td class="name">
    <a class="10007096239" href="javascript:;">
    MERLIER Tim
    </a>
    
    </td>
    <td class="team"><img alt="ALPECIN - FENIX" src="/Content/images/event/2020/tirreno-adriatico/jerseys/AFC.png" title="ALPECIN - FENIX"/></td>
    
    <td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">6</td>
    
    </tr>
    <tr>
    <td class="more" colspan="8"><a href="javascript:;">More...</a></td>
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">6</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">133</span></td>
    <td class="name">
    <a class="10028417041" href="javascript:;">
    CIMOLAI Davide
    </a>
    
    </td>
    <td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">5</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">7</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">213</span></td>
    <td class="name">
    <a class="10007216275" href="javascript:;">
    MANZIN Lorrenzo
    </a>
    
    </td>
    <td class="team"><img alt="TOTAL DIRECT ENERGIE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TDE.png" title="TOTAL DIRECT ENERGIE"/></td>
    
    <td class="noc"><img alt="FRA" src="/Content/images/flags/FRA.png" title="FRA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">4</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">8</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">23</span></td>
    <td class="name">
    <a class="10007744624" href="javascript:;">
    PACIONI Luca
    </a>
    
    </td>
    <td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">3</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">9</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">147</span></td>
    <td class="name">
    <a class="10010946028" href="javascript:;">
    <abbr title="Young rider">*</abbr>
    VERMEERSCH Florian
    </a>
    
    </td>
    <td class="team"><img alt="LOTTO SOUDAL" src="/Content/images/event/2020/tirreno-adriatico/jerseys/LTS.png" title="LOTTO SOUDAL"/></td>
    
    <td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">2</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">10</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">195</span></td>
    <td class="name">
    <a class="10006631548" href="javascript:;">
    TEUNISSEN Mike
    </a>
    
    </td>
    <td class="team"><img alt="JUMBO - VISMA" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TJV.png" title="JUMBO - VISMA"/></td>
    
    <td class="noc"><img alt="NED" src="/Content/images/flags/NED.png" title="NED"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">1</td>
    
    </tr>
    
    </tbody>
    </table>
</div>

坎内托-137公里
1.
21
5.
2.
54
3.
3.
247
2.
4.
63
1.
福罗尼卡-190公里
1.
62
12
2.
231
10
3.
137
8.
4.
91
7.
5.
12
6.
6.
133
5.
7.
213
4.
8.
23
3.
9
147
2.
10
195
1.

识别标题并使用
cumsum()
到groupby,然后将每个组附加到列表中

import pandas as pd
df = pd.DataFrame({'Col1': {0: 'thing', 1: '1', 2: '2', 3: 'thing2', 4: '1',5: '2', 6: '3'},
                   'Col2': {0:'' , 1: 2, 2: 3, 3:'' , 4: 2, 5: 3, 6: 4}})
gb = df.groupby((~df.Col2.astype(bool)).cumsum())


dfs = []
for k,g in gb:
    dfs.append(g.copy())

您可以使用
np.split()

输出:

  thing   
1     1  2
2     2  3
  thing2   
1      1  2
2      2  3
3      3  4

请提供用于生成第一个DataFrame的代码如何识别新标头的开头?对于这个示例数据,它是字符串,如果表数据也是字符串,怎么识别标题呢?生成第一个数据帧的代码只是一个漂亮的网页拼图,这就是为什么它如此混乱的原因@deadshot-标题将是文本,但表中的所有其他内容都是整数。但当然,因为其中有文本,所以列都是字符串。。我也犯了一个错误,其他的列都是空白的,标题在第1列。这个问题相当广泛。您在哪一部分遇到了问题?用标题标识行或在标识行后拆分行?您可能应该将html(一个简单的示例)作为您的文档的一部分。与尝试拆分复合数据帧相比,从标记中创建单独的数据帧可能是值得的。也就是说,要么帮助你,要么解决你的问题?
In [42]: dfs[0]
Out[42]: 
    Col1  Col2
0  thing     
1      1     2
2      2     3

In [43]: dfs[1]
Out[43]: 
     Col1  Col2
3  thing2     
4       1     2
5       2     3
6       3     4
import numpy as np


res = [x.reset_index(drop=True) for x in np.split(df, np.where(df.applymap(lambda x: x == ''))[0]) if not x.empty]
for x in res:
    x = x.rename(columns=x.iloc[0]).drop(x.index[0])
    print(x)
  thing   
1     1  2
2     2  3
  thing2   
1      1  2
2      2  3
3      3  4