Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/72.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Html 如何从网页中删除表格并排除表格中的特定表格<;td>;标签_Html_Python 3.x_Web Scraping_Beautifulsoup - Fatal编程技术网

Html 如何从网页中删除表格并排除表格中的特定表格<;td>;标签

Html 如何从网页中删除表格并排除表格中的特定表格<;td>;标签,html,python-3.x,web-scraping,beautifulsoup,Html,Python 3.x,Web Scraping,Beautifulsoup,我想从一个特定的网页上抓取一张表。问题在于,该表的某些td包含一个嵌套的span标记,该标记包含另一个嵌套表 我想从以下网页中抓取 我已经包含了一个表的小示例,我想用包含在带有类工具提示图标的span标记中的嵌套表来刮取它。当刮取整个表时,如何排除这些特定span标记内的内容 <tr style="font-size:12px;"> <td align="left">Abhanpur</td> <td align=

我想从一个特定的网页上抓取一张表。问题在于,该表的某些td包含一个嵌套的span标记,该标记包含另一个嵌套表

我想从以下网页中抓取

我已经包含了一个表的小示例,我想用包含在带有类工具提示图标的span标记中的嵌套表来刮取它。当刮取整个表时,如何排除这些特定span标记内的内容

<tr style="font-size:12px;">
<td align="left">Abhanpur</td>
<td align="center">53</td>
<td align="left">
    <table>
        <tbody>
            <tr>
                <td>DHANENDRA SAHU</td>
                <td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
                    <div class="tooltip">
                        <h3>Assembly Election Result 2013</h3>
                        <table>
                            <tbody>
                                <tr>
                                    <td>Party</td>
                                    <td>:</td>
                                    <td>Indian National Congress</td>
                                </tr>
                                <tr>
                                    <td>Result</td>
                                    <td>:</td>
                                    <td>WON</td>
                                </tr>
                                <tr>
                                    <td>Margin</td>
                                    <td>:</td>
                                    <td>8354</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</td>
<td align="left">
    <table>
        <tbody>
            <tr>
                <td>Indian National Congress</td>
                <td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
                    <div class="tooltip">
                        <h3>Current Assembly Election Result</h3>
                        <table>
                            <tbody>
                                <tr>
                                    <td>Leading In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                                <tr>
                                    <td>Won In</td>
                                    <td>:</td>
                                    <td>68</td>
                                </tr>
                                <tr>
                                    <td>Trailing In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</td>
<td align="left">CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA</td>
<td align="left">
    <table>
        <tbody>
            <tr>
                <td>Bharatiya Janata Party</td>
                <td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
                    <div class="tooltip">
                        <h3>Current Assembly Election Result</h3>
                        <table>
                            <tbody>
                                <tr>
                                    <td>Leading In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                                <tr>
                                    <td>Won In</td>
                                    <td>:</td>
                                    <td>15</td>
                                </tr>
                                <tr>
                                    <td>Trailing In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</td>
<td align="right">23471 </td>
<td align="center">Result Declared</td>
<td align="center" style="background-color: lightgray;">DHANENDRA SAHU</td>
<td align="center" style="background-color: lightgray;">Indian National Congress</td>
<td align="center" style="background-color: lightgray;">8354</td>
预期的输出是刮除表(不包括span标记及其嵌套表)。比如说

Abhanpur, 53 , DHANENDRA SAHU, Indian National Congress, CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA, Bharatiya Janata Party , 23471, Result Declared 

这方面的任何帮助都会非常有用。谢谢。

你可以用熊猫做这个:

import pandas as pd
page = pd.read_html('http://eciresults.nic.in/Statewises26.htm')
my_table = page[5]
我相信,这将为您提供一个包含您感兴趣的表的数据框架。如果您尝试:

my_table.iloc[[7]]
输出为:

7   Abhanpur    53  DHANENDRA SAHUiAssembly Election Result 2013Pa...   Indian National CongressiCurrent Assembly Elec...   CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA    Bharatiya Janata PartyiCurrent Assembly Electi...   23471   Result Declared     DHANENDRA SAHU  Indian National Congress    8354    NaN     NaN

如果这是您想要的,您可以使用标准的pandas方法清理您的表。

这只是我的偏好,但每当我看到
标记时,我都使用pandas进行解析,然后根据需要操纵数据帧。它还允许您在一行中写入文件:

import pandas as pd

results_df = pd.DataFrame()
url_list = [1,2,3,4,5,6,7,8]
url = 'http://eciresults.nic.in/Statewises26.htm'

dfs = pd.read_html(url)
df = dfs[0]

idx = df[df[0] == '1\xa02\xa03\xa04\xa05\xa06\xa07\xa08\xa09\xa0Next >>'].index[0]
cols = list(df.iloc[idx-1,:])


df.columns = cols

df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')

df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]

results_df = results_df.append(df)

for x in url_list:
    url = 'http://eciresults.nic.in/Statewises26%s.htm' %x
    print ('Processed %s' %url)
    dfs = pd.read_html(url)
    df = dfs[0]

    df.columns = cols

    df = df[df['Const. No.'].notnull()]
    df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
    df = df.dropna(axis=1,how='all')

    df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
    df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
    df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
    df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]

    results_df = results_df.append(df).reset_index(drop=True)

results_df.to_csv('Chhattisgarh_cand.csv', index=False)
输出:

print (df.to_string())
  Constituency Const. No.       Leading Candidate                    Leading Party                    Trailing Candidate            Trailing Party Margin           Status          Winning Candidate             Winning Party Margin
0     Abhanpur         53          DHANENDRA SAHU         Indian National Congress  CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA    Bharatiya Janata Party  23471  Result Declared             DHANENDRA SAHU  Indian National Congress   8354
1      Ahiwara         67        GURU RUDRA KUMAR         Indian National Congress            RAJMAHANT SANWLA RAM DAHRE    Bharatiya Janata Party  31687  Result Declared  RAJMAHNT SANWLA RAM DAHRE    Bharatiya Janata Party  31676
2     Akaltara         33           SAURABH SINGH           Bharatiya Janata Party                            RICHA JOGI       Bahujan Samaj Party   1854  Result Declared             CHUNNILAL SAHU  Indian National Congress  21693
3    Ambikapur         10               T.S. BABA         Indian National Congress                      ANURAG SINGH DEO    Bharatiya Janata Party  39624  Result Declared                   T.S.BABA  Indian National Congress  19558
4     Antagarh         79               ANOOP NAG         Indian National Congress                         VIKRAM USENDI    Bharatiya Janata Party  13414  Result Declared              VIKRAM USENDI    Bharatiya Janata Party   5171
5        Arang         52  DR. SHIVKUMAR DAHARIYA         Indian National Congress                         SANJAY DHIDHI    Bharatiya Janata Party  25077  Result Declared           NAVEEN MARKANDEY    Bharatiya Janata Party  13774
6  Baikunthpur          3        AMBICA SINGH DEO         Indian National Congress                     BHAIYALAL RAJWADE    Bharatiya Janata Party   5339  Result Declared          BHAIYALAL RAJWADE    Bharatiya Janata Party   1069
7  Balodabazar         45     PRAMOD KUMAR SHARMA  Janta Congress Chhattisgarh (J)                       JANAK RAM VERMA  Indian National Congress   2129  Result Declared            JANAK RAM VERMA  Indian National Congress   9977
8        Basna         40  DEVENDRA BAHADUR SINGH         Indian National Congress                        SAMPAT AGRAWAL               Independent  17508  Result Declared        RUPKUMARI CHOUDHARY    Bharatiya Janata Party   6239
9       Bastar         85       BAGHEL LAKHESHWAR         Indian National Congress                    DR. SUBHAU KASHYAP    Bharatiya Janata Party  33471  Result Declared          BAGHEL LAKHESHWAR  Indian National Congress  19168

在运行代码时,我遇到了一个错误:-回溯(最近一次调用):文件“C:/Users/vigneshkumark/PycharmProjects/untitled1/new.py”,第8行,在idx=df[df[0]=“1 2 3 4 5 6 7 8 9 Next>>”。索引[0]文件“C:\Users\vigneshkumark\PycharmProjects\untitled1\venv\lib\site packages\pandas\core\index\base.py”,第3958行,在getitem return中getitem(key)Indexer错误:索引0超出大小为0的轴0的界限,而且在后面的候选列中的输出中,我只需要候选项的名称。4排o/p预计为“ANURAG SINGH DEO”,如果您能进行必要的更正,将对我非常有帮助。谢谢@chitown88您安装了什么版本的pandas?pandas 0.24.2是版本Iused@VigneshKumar再试试看。
print (df.to_string())
  Constituency Const. No.       Leading Candidate                    Leading Party                    Trailing Candidate            Trailing Party Margin           Status          Winning Candidate             Winning Party Margin
0     Abhanpur         53          DHANENDRA SAHU         Indian National Congress  CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA    Bharatiya Janata Party  23471  Result Declared             DHANENDRA SAHU  Indian National Congress   8354
1      Ahiwara         67        GURU RUDRA KUMAR         Indian National Congress            RAJMAHANT SANWLA RAM DAHRE    Bharatiya Janata Party  31687  Result Declared  RAJMAHNT SANWLA RAM DAHRE    Bharatiya Janata Party  31676
2     Akaltara         33           SAURABH SINGH           Bharatiya Janata Party                            RICHA JOGI       Bahujan Samaj Party   1854  Result Declared             CHUNNILAL SAHU  Indian National Congress  21693
3    Ambikapur         10               T.S. BABA         Indian National Congress                      ANURAG SINGH DEO    Bharatiya Janata Party  39624  Result Declared                   T.S.BABA  Indian National Congress  19558
4     Antagarh         79               ANOOP NAG         Indian National Congress                         VIKRAM USENDI    Bharatiya Janata Party  13414  Result Declared              VIKRAM USENDI    Bharatiya Janata Party   5171
5        Arang         52  DR. SHIVKUMAR DAHARIYA         Indian National Congress                         SANJAY DHIDHI    Bharatiya Janata Party  25077  Result Declared           NAVEEN MARKANDEY    Bharatiya Janata Party  13774
6  Baikunthpur          3        AMBICA SINGH DEO         Indian National Congress                     BHAIYALAL RAJWADE    Bharatiya Janata Party   5339  Result Declared          BHAIYALAL RAJWADE    Bharatiya Janata Party   1069
7  Balodabazar         45     PRAMOD KUMAR SHARMA  Janta Congress Chhattisgarh (J)                       JANAK RAM VERMA  Indian National Congress   2129  Result Declared            JANAK RAM VERMA  Indian National Congress   9977
8        Basna         40  DEVENDRA BAHADUR SINGH         Indian National Congress                        SAMPAT AGRAWAL               Independent  17508  Result Declared        RUPKUMARI CHOUDHARY    Bharatiya Janata Party   6239
9       Bastar         85       BAGHEL LAKHESHWAR         Indian National Congress                    DR. SUBHAU KASHYAP    Bharatiya Janata Party  33471  Result Declared          BAGHEL LAKHESHWAR  Indian National Congress  19168