Python不提供链接列表_Python_Html_Screen Scraping_Data Cleaning

Python不提供链接列表

python html

Python不提供链接列表,python,html,screen-scraping,data-cleaning,Python,Html,Screen Scraping,Data Cleaning,因此，由于我需要更详细的数据，我必须对网站的HTML代码进行更深入的挖掘。我编写了一个脚本，返回到详细页面的特定链接列表，但我无法使用Python搜索列表中的每个链接，它总是在第一个链接处停止。我做错了什么 from BeautifulSoup import BeautifulSoup import urllib2 from lxml import html import requests #Open site html_page = urllib2.urlopen("http:/

因此，由于我需要更详细的数据，我必须对网站的HTML代码进行更深入的挖掘。我编写了一个脚本，返回到详细页面的特定链接列表，但我无法使用Python搜索列表中的每个链接，它总是在第一个链接处停止。我做错了什么

 from BeautifulSoup import BeautifulSoup
 import urllib2
 from lxml import html
 import requests

 #Open site
 html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx")

#Inform BeautifulSoup
soup = BeautifulSoup(html_page)

#Search for the specific links
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
    #print found links
    print link.get('href')
    #complete links
    complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
    #print complete links
    print complete_links
#
#EVERYTHING WORKS FINE TO THIS POINT
#

page = requests.get(complete_links)
tree = html.fromstring(page.text)

#Details
name = tree.xpath('//dl[@class="services"]')

for i in name:
    print i.text_content()

另外：您可以推荐我学习如何将输出放入文件并进行清理、提供变量名称等的教程吗？

我想您需要的是

完整链接中的链接列表，而不是单个链接。正如@Pynchia和@lemonhead所说，您正在覆盖第一个for循环的每个迭代complete\u链接
您需要进行两项更改：

将链接附加到列表，并使用它循环和废弃每个链接
# [...] Same code here

links_list = []
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
    print link.get('href')
    complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
    print complete_links
    link_list.append(complete_links)  # append new link to the list


在另一个循环中废弃每个累积的链接
for link in link_list:
    page = requests.get(link)
    tree = html.fromstring(page.text)

    #Details
    name = tree.xpath('//dl[@class="services"]')

    for i in name:
        print i.text_content()



PS：我建议这样的任务。
您使用complete\u links
的GET
请求不在for循环内，因此它只会在退出循环之前使用的最后一个值complete\u links
运行一次。您在每个循环中都覆盖complete\u links
，对吗？或者您想要一个链接列表吗？complete\u links
是否可能被视为一个要检查的值列表？为什么您要同时使用请求和urlib2
？为什么要同时使用BeautifulSoup
和lxml
？我觉得很困惑…非常感谢！我肯定会花一些时间来研究这些粗糙的文档！