Html 如何从网页中删除表格并排除表格中的特定表格<;td>;标签
我想从一个特定的网页上抓取一张表。问题在于,该表的某些td包含一个嵌套的span标记,该标记包含另一个嵌套表 我想从以下网页中抓取 我已经包含了一个表的小示例,我想用包含在带有类工具提示图标的span标记中的嵌套表来刮取它。当刮取整个表时,如何排除这些特定span标记内的内容Html 如何从网页中删除表格并排除表格中的特定表格<;td>;标签,html,python-3.x,web-scraping,beautifulsoup,Html,Python 3.x,Web Scraping,Beautifulsoup,我想从一个特定的网页上抓取一张表。问题在于,该表的某些td包含一个嵌套的span标记,该标记包含另一个嵌套表 我想从以下网页中抓取 我已经包含了一个表的小示例,我想用包含在带有类工具提示图标的span标记中的嵌套表来刮取它。当刮取整个表时,如何排除这些特定span标记内的内容 <tr style="font-size:12px;"> <td align="left">Abhanpur</td> <td align=
<tr style="font-size:12px;">
<td align="left">Abhanpur</td>
<td align="center">53</td>
<td align="left">
<table>
<tbody>
<tr>
<td>DHANENDRA SAHU</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Assembly Election Result 2013</h3>
<table>
<tbody>
<tr>
<td>Party</td>
<td>:</td>
<td>Indian National Congress</td>
</tr>
<tr>
<td>Result</td>
<td>:</td>
<td>WON</td>
</tr>
<tr>
<td>Margin</td>
<td>:</td>
<td>8354</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="left">
<table>
<tbody>
<tr>
<td>Indian National Congress</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Current Assembly Election Result</h3>
<table>
<tbody>
<tr>
<td>Leading In</td>
<td>:</td>
<td>0</td>
</tr>
<tr>
<td>Won In</td>
<td>:</td>
<td>68</td>
</tr>
<tr>
<td>Trailing In</td>
<td>:</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="left">CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA</td>
<td align="left">
<table>
<tbody>
<tr>
<td>Bharatiya Janata Party</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Current Assembly Election Result</h3>
<table>
<tbody>
<tr>
<td>Leading In</td>
<td>:</td>
<td>0</td>
</tr>
<tr>
<td>Won In</td>
<td>:</td>
<td>15</td>
</tr>
<tr>
<td>Trailing In</td>
<td>:</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="right">23471 </td>
<td align="center">Result Declared</td>
<td align="center" style="background-color: lightgray;">DHANENDRA SAHU</td>
<td align="center" style="background-color: lightgray;">Indian National Congress</td>
<td align="center" style="background-color: lightgray;">8354</td>
预期的输出是刮除表(不包括span标记及其嵌套表)。比如说
Abhanpur, 53 , DHANENDRA SAHU, Indian National Congress, CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA, Bharatiya Janata Party , 23471, Result Declared
这方面的任何帮助都会非常有用。谢谢。你可以用熊猫做这个:
import pandas as pd
page = pd.read_html('http://eciresults.nic.in/Statewises26.htm')
my_table = page[5]
我相信,这将为您提供一个包含您感兴趣的表的数据框架。如果您尝试:
my_table.iloc[[7]]
输出为:
7 Abhanpur 53 DHANENDRA SAHUiAssembly Election Result 2013Pa... Indian National CongressiCurrent Assembly Elec... CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA Bharatiya Janata PartyiCurrent Assembly Electi... 23471 Result Declared DHANENDRA SAHU Indian National Congress 8354 NaN NaN
如果这是您想要的,您可以使用标准的pandas方法清理您的表。这只是我的偏好,但每当我看到
标记时,我都使用pandas进行解析,然后根据需要操纵数据帧。它还允许您在一行中写入文件:
import pandas as pd
results_df = pd.DataFrame()
url_list = [1,2,3,4,5,6,7,8]
url = 'http://eciresults.nic.in/Statewises26.htm'
dfs = pd.read_html(url)
df = dfs[0]
idx = df[df[0] == '1\xa02\xa03\xa04\xa05\xa06\xa07\xa08\xa09\xa0Next >>'].index[0]
cols = list(df.iloc[idx-1,:])
df.columns = cols
df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')
df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]
results_df = results_df.append(df)
for x in url_list:
url = 'http://eciresults.nic.in/Statewises26%s.htm' %x
print ('Processed %s' %url)
dfs = pd.read_html(url)
df = dfs[0]
df.columns = cols
df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')
df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]
results_df = results_df.append(df).reset_index(drop=True)
results_df.to_csv('Chhattisgarh_cand.csv', index=False)
输出:
print (df.to_string())
Constituency Const. No. Leading Candidate Leading Party Trailing Candidate Trailing Party Margin Status Winning Candidate Winning Party Margin
0 Abhanpur 53 DHANENDRA SAHU Indian National Congress CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA Bharatiya Janata Party 23471 Result Declared DHANENDRA SAHU Indian National Congress 8354
1 Ahiwara 67 GURU RUDRA KUMAR Indian National Congress RAJMAHANT SANWLA RAM DAHRE Bharatiya Janata Party 31687 Result Declared RAJMAHNT SANWLA RAM DAHRE Bharatiya Janata Party 31676
2 Akaltara 33 SAURABH SINGH Bharatiya Janata Party RICHA JOGI Bahujan Samaj Party 1854 Result Declared CHUNNILAL SAHU Indian National Congress 21693
3 Ambikapur 10 T.S. BABA Indian National Congress ANURAG SINGH DEO Bharatiya Janata Party 39624 Result Declared T.S.BABA Indian National Congress 19558
4 Antagarh 79 ANOOP NAG Indian National Congress VIKRAM USENDI Bharatiya Janata Party 13414 Result Declared VIKRAM USENDI Bharatiya Janata Party 5171
5 Arang 52 DR. SHIVKUMAR DAHARIYA Indian National Congress SANJAY DHIDHI Bharatiya Janata Party 25077 Result Declared NAVEEN MARKANDEY Bharatiya Janata Party 13774
6 Baikunthpur 3 AMBICA SINGH DEO Indian National Congress BHAIYALAL RAJWADE Bharatiya Janata Party 5339 Result Declared BHAIYALAL RAJWADE Bharatiya Janata Party 1069
7 Balodabazar 45 PRAMOD KUMAR SHARMA Janta Congress Chhattisgarh (J) JANAK RAM VERMA Indian National Congress 2129 Result Declared JANAK RAM VERMA Indian National Congress 9977
8 Basna 40 DEVENDRA BAHADUR SINGH Indian National Congress SAMPAT AGRAWAL Independent 17508 Result Declared RUPKUMARI CHOUDHARY Bharatiya Janata Party 6239
9 Bastar 85 BAGHEL LAKHESHWAR Indian National Congress DR. SUBHAU KASHYAP Bharatiya Janata Party 33471 Result Declared BAGHEL LAKHESHWAR Indian National Congress 19168
在运行代码时,我遇到了一个错误:-回溯(最近一次调用):文件“C:/Users/vigneshkumark/PycharmProjects/untitled1/new.py”,第8行,在idx=df[df[0]=“1 2 3 4 5 6 7 8 9 Next>>”。索引[0]文件“C:\Users\vigneshkumark\PycharmProjects\untitled1\venv\lib\site packages\pandas\core\index\base.py”,第3958行,在getitem return中getitem(key)Indexer错误:索引0超出大小为0的轴0的界限,而且在后面的候选列中的输出中,我只需要候选项的名称。4排o/p预计为“ANURAG SINGH DEO”,如果您能进行必要的更正,将对我非常有帮助。谢谢@chitown88您安装了什么版本的pandas?pandas 0.24.2是版本Iused@VigneshKumar再试试看。
print (df.to_string())
Constituency Const. No. Leading Candidate Leading Party Trailing Candidate Trailing Party Margin Status Winning Candidate Winning Party Margin
0 Abhanpur 53 DHANENDRA SAHU Indian National Congress CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA Bharatiya Janata Party 23471 Result Declared DHANENDRA SAHU Indian National Congress 8354
1 Ahiwara 67 GURU RUDRA KUMAR Indian National Congress RAJMAHANT SANWLA RAM DAHRE Bharatiya Janata Party 31687 Result Declared RAJMAHNT SANWLA RAM DAHRE Bharatiya Janata Party 31676
2 Akaltara 33 SAURABH SINGH Bharatiya Janata Party RICHA JOGI Bahujan Samaj Party 1854 Result Declared CHUNNILAL SAHU Indian National Congress 21693
3 Ambikapur 10 T.S. BABA Indian National Congress ANURAG SINGH DEO Bharatiya Janata Party 39624 Result Declared T.S.BABA Indian National Congress 19558
4 Antagarh 79 ANOOP NAG Indian National Congress VIKRAM USENDI Bharatiya Janata Party 13414 Result Declared VIKRAM USENDI Bharatiya Janata Party 5171
5 Arang 52 DR. SHIVKUMAR DAHARIYA Indian National Congress SANJAY DHIDHI Bharatiya Janata Party 25077 Result Declared NAVEEN MARKANDEY Bharatiya Janata Party 13774
6 Baikunthpur 3 AMBICA SINGH DEO Indian National Congress BHAIYALAL RAJWADE Bharatiya Janata Party 5339 Result Declared BHAIYALAL RAJWADE Bharatiya Janata Party 1069
7 Balodabazar 45 PRAMOD KUMAR SHARMA Janta Congress Chhattisgarh (J) JANAK RAM VERMA Indian National Congress 2129 Result Declared JANAK RAM VERMA Indian National Congress 9977
8 Basna 40 DEVENDRA BAHADUR SINGH Indian National Congress SAMPAT AGRAWAL Independent 17508 Result Declared RUPKUMARI CHOUDHARY Bharatiya Janata Party 6239
9 Bastar 85 BAGHEL LAKHESHWAR Indian National Congress DR. SUBHAU KASHYAP Bharatiya Janata Party 33471 Result Declared BAGHEL LAKHESHWAR Indian National Congress 19168