在Python中生成URL？_Python - Fatal编程技术网

在Python中生成URL？

python

在Python中生成URL？,python,Python,我正在尝试获取文章的所有链接（碰巧有“title may blank”类来表示它们）。我试图弄明白为什么下面的代码在运行时会生成一大堆“href=”，而不是返回实际的URL。在失败的25篇文章URL（全部为'href='）之后，我还收到了一堆随机文本和链接，但不确定为什么会发生这种情况，因为它应该在停止查找类'title may blank'后停止。你们能帮我找出哪里不对劲吗 import urllib2 def get_page(page): response = urllib2.

我正在尝试获取文章的所有链接（碰巧有“title may blank”类来表示它们）。我试图弄明白为什么下面的代码在运行时会生成一大堆“href=”，而不是返回实际的URL。在失败的25篇文章URL（全部为'href='）之后，我还收到了一堆随机文本和链接，但不确定为什么会发生这种情况，因为它应该在停止查找类'title may blank'后停止。你们能帮我找出哪里不对劲吗

import urllib2

def get_page(page):

    response = urllib2.urlopen(page)
    html = response.read()
    p = str(html)
    return p

def get_next_target(page):
    start_link = page.find('title may-blank')
    start_quote = page.find('"', start_link + 4)
    end_quote = page.find ('"', start_quote + 1)
    aurl = page[start_quote+1:end_quote] # Gets Article URL
    return aurl, end_quote

def print_all_links(page):
    while True:
        aurl, endpos = get_next_target(page)
        if aurl:
            print("%s" % (aurl))
            print("")
            page = page[endpos:]
        else:
            break

reddit_url = 'http://www.reddit.com/r/worldnews'

print_all_links(get_page(reddit_url))

是正确的，但是当我面对一个问题时，我宁愿提供完成

的最佳方法，而不是修复

的方法。您应该使用HTML解析器来解析网页：

from bs4 import BeautifulSoup
import urllib2

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    soup = BeautifulSoup(html)
    for a in soup.find_all('a', 'title may-blank ', href=True):
        print(a['href'])

如果您确实对HTML解析器过敏，至少使用正则表达式（即使您应该坚持使用HTML解析）：

导入urllib2
进口稀土
def打印所有链接（第页）：
html=urllib2.urlopen（page.read）（）
对于re.findall（r'中的href，这是因为
start_quote = page.find('"', start_link + 4)

不会做你认为它会做的事。
“开始链接”设置为“标题可能为空”的索引。因此，如果您在“开始链接+4”中查找页面，实际上您将在“e可能为空”中开始搜索。
如果你改变
start_quote = page.find('"', start_link + 4)

到
它会起作用的。为什么不使用类似BeautifulSoup（）的工具来删除链接呢？谢谢！我会试试看，我不知道有这样的链接存在。
start_quote = page.find('"', start_link + 4)

start_quote = page.find('"', start_link + len('title may-blank') + 1)