获取源页面中存在的表的html数据的Python代码_Python_Python 2.7_Beautifulsoup

获取源页面中存在的表的html数据的Python代码

python python-2.7

获取源页面中存在的表的html数据的Python代码,python,python-2.7,beautifulsoup,Python,Python 2.7,Beautifulsoup,我是python新手，我正在尝试创建一个网站。我可以登录到一个网站，并获得一个html页面，但我不需要整个页面，我只需要在指定的表中的超链接我已经写了下面的代码，但这得到了所有的超链接 soup = BeautifulSoup(the_page) for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ): for link in soup.findAll('a'):

我是python新手，我正在尝试创建一个网站。我可以登录到一个网站，并获得一个html页面，但我不需要整个页面，我只需要在指定的表中的超链接

我已经写了下面的代码，但这得到了所有的超链接

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
        for link in soup.findAll('a'):
                print link.get('href')

谁能帮我一下我哪里出了问题

下面是表格的html文本

<table id="ctl00_Main_lvMyAccount_Table1" width="680px">
 <tr id="ctl00_Main_lvMyAccount_Tr1">
    <td id="ctl00_Main_lvMyAccount_Td1">
                        <table id="ctl00_Main_lvMyAccount_itemPlaceholderContainer" border="1" cellspacing="0" cellpadding="3">
        <tr id="ctl00_Main_lvMyAccount_Tr2" style="background-color:#0090dd;">
            <th id="ctl00_Main_lvMyAccount_Th1"></th>
            <th id="ctl00_Main_lvMyAccount_Th2">

                                    <a id="ctl00_Main_lvMyAccount_SortByAcctNum" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')">
                                        <font color=white>
                                            <span id="ctl00_Main_lvMyAccount_AcctNum">Account number</span>
                                        </font>

                                        </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th4">
                                    <a id="ctl00_Main_lvMyAccount_SortByServAdd" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_ServiceAddress">Service address</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th5">
                                    <a id="ctl00_Main_lvMyAccount_SortByAcctName" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_AcctName">Name</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th6">
                                    <a id="ctl00_Main_lvMyAccount_SortByStatus" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_AcctStatus">Account status</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th3"></th>
        </tr>


            <tr>
                <td>

提前谢谢。

好吧，这是正确的方法

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ): 
        for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

此外，您还可以跳过父循环，因为指定id只有一个匹配项：

soup = BeautifulSoup(the_page)
table = soup.find('table',{'id':'ctl00_Main_lvMyAccount_Table1'})
for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

更新：注意到@DSM所说的。修复了表分配中缺少的引号。

好吧，这是正确的方法

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ): 
        for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

此外，您还可以跳过父循环，因为指定id只有一个匹配项：

soup = BeautifulSoup(the_page)
table = soup.find('table',{'id':'ctl00_Main_lvMyAccount_Table1'})
for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

更新：注意到@DSM所说的。修复了表分配中缺少的引号。

确保for循环在表html中查找（而不是

soup

变量，即页面html）：

结果

确保for循环在表html中查找（而不是

soup

变量，即页面html）：

结果

您对soup.findAll（'a'）：中链接的嵌套循环正在搜索整个HTML页面。如果要搜索表中的链接，请将该行更改为：

for link in table.findAll('a'):

您对soup.findAll（'a'）：中链接的嵌套循环正在搜索整个HTML页面。如果要搜索表中的链接，请将该行更改为：

for link in table.findAll('a'):

你需要哪些超链接？所有a锚标记的所有href，我只粘贴了html的一部分，列表中有很多你需要哪些超链接？所有a锚标记的所有href，我只粘贴了html的一部分，列表中有很多traceback（最近一次调用）：File“C:\MiamiDade_Scraping\latest1.py”，第59行，在table=the_page.find（'table'，{'id'：'ctl00_Main\u lvMyAccount_Table1'）类型错误：切片索引必须是整数或无，或具有索引方法——获取此错误您使用的python和beautifulsoup的版本是什么？回溯（最近一次调用）：文件”C:\MiamiDade_Scraping\latest1.py”，第59行，在table=the_page.find（'table'，{'id'：'ctl00_Main\u lvMyAccount\u Table1'）中）TypeError：切片索引必须是整数或无，或者有一个索引方法--获取此错误您使用的是什么版本的python和beautifulsoup？感谢它的工作。但我仍然只需要几个链接。就像现在我获取所有链接一样，但在这些链接中，我只需要选择链接，而不需要删除链接。selecl链接的html代码比它可以工作。但是我仍然只需要几个链接。就像现在我得到了所有的链接，但是在这些链接中，我只需要选择链接，而不需要删除链接。selecl链接的html代码是，你在

ctl

之前缺少了一个

。

在

ctl

之前缺少了一个

。