Python 如何在一行中的标题处拆分数据帧
我正在抓取一个页面,大多数表格都是格式标题--info。我可以使用pandas.read_html遍历大多数表,并为所有不同的信息创建单独的数据帧 然而,在有些情况下,他们将信息合并到一个带有副标题的表中,我希望这些副标题是单独的数据帧,该行的文本作为标题(附加列表) 是否有一种简单的方法来拆分此数据帧-它将始终是标题后跟关联行,新标题后跟新关联行 例如 应该是Python 如何在一行中的标题处拆分数据帧,python,pandas,dataframe,split,Python,Pandas,Dataframe,Split,我正在抓取一个页面,大多数表格都是格式标题--info。我可以使用pandas.read_html遍历大多数表,并为所有不同的信息创建单独的数据帧 然而,在有些情况下,他们将信息合并到一个带有副标题的表中,我希望这些副标题是单独的数据帧,该行的文本作为标题(附加列表) 是否有一种简单的方法来拆分此数据帧-它将始终是标题后跟关联行,新标题后跟新关联行 例如 应该是 thing 1 1 1 2 2 2 thing2 4 1 2 5 2 3 6 3 4 如果
thing
1 1 1
2 2 2
thing2
4 1 2
5 2 3
6 3 4
如果人们只创建有数据意义的网页就好了,但这里不是这样
我尝试过ItErrors,但似乎找不到一个好方法来创建我想要的东西
非常感谢您的帮助
<div class="ranking">
<h6><a href="javascript:;">Sprint</a></h6>
<table>
<tbody>
</tbody>
<tbody>
<tr>
<td class="title" colspan="8">Canneto - km 137</td>
</tr>
<tr>
<td class="rank"><span title="Rank">1</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">21</span></td>
<td class="name">
<a class="10010085859" href="javascript:;">
<abbr title="Young rider">*</abbr>
BAGIOLI Nicola
</a>
</td>
<td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">5</td>
</tr>
<tr>
<td class="rank"><span title="Rank">2</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">54</span></td>
<td class="name">
<a class="10008688453" href="javascript:;">
ORSINI Umberto
</a>
</td>
<td class="team"><img alt="BARDIANI CSF FAIZANE'" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BCF.png" title="BARDIANI CSF FAIZANE'"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">3</td>
</tr>
<tr>
<td class="rank"><span title="Rank">3</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">247</span></td>
<td class="name">
<a class="10005658114" href="javascript:;">
ZARDINI Edoardo
</a>
</td>
<td class="team"><img alt="VINI ZABU' KTM" src="/Content/images/event/2020/tirreno-adriatico/jerseys/THR.png" title="VINI ZABU' KTM"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">2</td>
</tr>
<tr>
<td class="rank"><span title="Rank">4</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">63</span></td>
<td class="name">
<a class="10003349312" href="javascript:;">
BODNAR Maciej
</a>
</td>
<td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
<td class="noc"><img alt="POL" src="/Content/images/flags/POL.png" title="POL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">1</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="title" colspan="8">Follonica - km 190</td>
</tr>
<tr>
<td class="rank"><span title="Rank">1</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">62</span></td>
<td class="name">
<a class="10007738055" href="javascript:;">
ACKERMANN Pascal
</a>
</td>
<td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
<td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">12</td>
</tr>
<tr>
<td class="rank"><span title="Rank">2</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">231</span></td>
<td class="name">
<a class="10008656828" href="javascript:;">
GAVIRIA RENDON Fernando
</a>
</td>
<td class="team"><img alt="UAE TEAM EMIRATES" src="/Content/images/event/2020/tirreno-adriatico/jerseys/UAD.png" title="UAE TEAM EMIRATES"/></td>
<td class="noc"><img alt="COL" src="/Content/images/flags/COL.png" title="COL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">10</td>
</tr>
<tr>
<td class="rank"><span title="Rank">3</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">137</span></td>
<td class="name">
<a class="10007506366" href="javascript:;">
ZABEL Rick
</a>
</td>
<td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
<td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">8</td>
</tr>
<tr>
<td class="rank"><span title="Rank">4</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">91</span></td>
<td class="name">
<a class="10008661777" href="javascript:;">
BALLERINI Davide
</a>
</td>
<td class="team"><img alt="DECEUNINCK - QUICK - STEP " src="/Content/images/event/2020/tirreno-adriatico/jerseys/DQT.png" title="DECEUNINCK - QUICK - STEP "/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">7</td>
</tr>
<tr>
<td class="rank"><span title="Rank">5</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">12</span></td>
<td class="name">
<a class="10007096239" href="javascript:;">
MERLIER Tim
</a>
</td>
<td class="team"><img alt="ALPECIN - FENIX" src="/Content/images/event/2020/tirreno-adriatico/jerseys/AFC.png" title="ALPECIN - FENIX"/></td>
<td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">6</td>
</tr>
<tr>
<td class="more" colspan="8"><a href="javascript:;">More...</a></td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">6</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">133</span></td>
<td class="name">
<a class="10028417041" href="javascript:;">
CIMOLAI Davide
</a>
</td>
<td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">5</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">7</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">213</span></td>
<td class="name">
<a class="10007216275" href="javascript:;">
MANZIN Lorrenzo
</a>
</td>
<td class="team"><img alt="TOTAL DIRECT ENERGIE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TDE.png" title="TOTAL DIRECT ENERGIE"/></td>
<td class="noc"><img alt="FRA" src="/Content/images/flags/FRA.png" title="FRA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">4</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">8</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">23</span></td>
<td class="name">
<a class="10007744624" href="javascript:;">
PACIONI Luca
</a>
</td>
<td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">3</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">9</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">147</span></td>
<td class="name">
<a class="10010946028" href="javascript:;">
<abbr title="Young rider">*</abbr>
VERMEERSCH Florian
</a>
</td>
<td class="team"><img alt="LOTTO SOUDAL" src="/Content/images/event/2020/tirreno-adriatico/jerseys/LTS.png" title="LOTTO SOUDAL"/></td>
<td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">2</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">10</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">195</span></td>
<td class="name">
<a class="10006631548" href="javascript:;">
TEUNISSEN Mike
</a>
</td>
<td class="team"><img alt="JUMBO - VISMA" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TJV.png" title="JUMBO - VISMA"/></td>
<td class="noc"><img alt="NED" src="/Content/images/flags/NED.png" title="NED"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">1</td>
</tr>
</tbody>
</table>
</div>
坎内托-137公里
1.
21
5.
2.
54
3.
3.
247
2.
4.
63
1.
福罗尼卡-190公里
1.
62
12
2.
231
10
3.
137
8.
4.
91
7.
5.
12
6.
6.
133
5.
7.
213
4.
8.
23
3.
9
147
2.
10
195
1.
识别标题并使用cumsum()
到groupby,然后将每个组附加到列表中
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'thing', 1: '1', 2: '2', 3: 'thing2', 4: '1',5: '2', 6: '3'},
'Col2': {0:'' , 1: 2, 2: 3, 3:'' , 4: 2, 5: 3, 6: 4}})
gb = df.groupby((~df.Col2.astype(bool)).cumsum())
dfs = []
for k,g in gb:
dfs.append(g.copy())
您可以使用
np.split()
输出:
thing
1 1 2
2 2 3
thing2
1 1 2
2 2 3
3 3 4
请提供用于生成第一个DataFrame的代码如何识别新标头的开头?对于这个示例数据,它是字符串,如果表数据也是字符串,怎么识别标题呢?生成第一个数据帧的代码只是一个漂亮的网页拼图,这就是为什么它如此混乱的原因@deadshot-标题将是文本,但表中的所有其他内容都是整数。但当然,因为其中有文本,所以列都是字符串。。我也犯了一个错误,其他的列都是空白的,标题在第1列。这个问题相当广泛。您在哪一部分遇到了问题?用标题标识行或在标识行后拆分行?您可能应该将html(一个简单的示例)作为您的文档的一部分。与尝试拆分复合数据帧相比,从标记中创建单独的数据帧可能是值得的。也就是说,要么帮助你,要么解决你的问题?
In [42]: dfs[0]
Out[42]:
Col1 Col2
0 thing
1 1 2
2 2 3
In [43]: dfs[1]
Out[43]:
Col1 Col2
3 thing2
4 1 2
5 2 3
6 3 4
import numpy as np
res = [x.reset_index(drop=True) for x in np.split(df, np.where(df.applymap(lambda x: x == ''))[0]) if not x.empty]
for x in res:
x = x.rename(columns=x.iloc[0]).drop(x.index[0])
print(x)
thing
1 1 2
2 2 3
thing2
1 1 2
2 2 3
3 3 4