Python-使用BeautifulSoup创建URL列表时出现问题_Python_List_Beautifulsoup

Python-使用BeautifulSoup创建URL列表时出现问题

python list

Python-使用BeautifulSoup创建URL列表时出现问题,python,list,beautifulsoup,Python,List,Beautifulsoup,我试图使用BeautifulSoup创建Python爬虫程序，但收到一个错误，即我试图将非字符串或其他字符缓冲区类型写入文件。通过检查程序输出，我发现我的列表中包含许多没有的项。除了没有，我还有很多图片和东西，它们不是链接，而是我列表中的图片链接。我怎样才能只将URL添加到我的列表中 import urllib from BeautifulSoup import * try: with open('url_file', 'r') as f:

我试图使用BeautifulSoup创建Python爬虫程序，但收到一个错误，即我试图将非字符串或其他字符缓冲区类型写入文件。通过检查程序输出，我发现我的列表中包含许多没有的项。除了没有，我还有很多图片和东西，它们不是链接，而是我列表中的图片链接。我怎样才能只将URL添加到我的列表中

    import urllib
    from BeautifulSoup import *

    try:
        with open('url_file', 'r') as f:
            url_list = [line.rstrip('\n') for line in f]
            f.close()
        with open('old_file', 'r') as x:
            old_list = [line.rstrip('\n') for line in f]
            f.close()
    except:
        url_list = list()
        old_list = list()
        #for Testing
        url_list.append("http://www.dinamalar.com/")


    count = 0


    for item in url_list:
        try:
            count = count + 1
            if count > 5:
                break

            html = urllib.urlopen(item).read()
            soup = BeautifulSoup(html)
            tags = soup('a')

            for tag in tags:

                if tag in old_list:
                    continue
                else:
                    url_list.append(tag.get('href', None))


            old_list.append(item)
            #for testing
            print url_list
        except:
            continue

    with open('url_file', 'w') as f:
        for s in url_list:
            f.write(s)
            f.write('\n')


    with open('old_file', 'w') as f:
        for s in old_list:
            f.write(s)

首先，不要使用不再维护的BeautifulSoup3，您的错误是因为并非所有锚点都有href，因此您尝试不写入任何导致错误的锚点，请使用find_all并设置href=True，以便只查找具有href属性的锚点标记：

soup = BeautifulSoup(html)
tags = soup.find_all("a", href=True)

此外，除了语句外，不要使用毯子，始终捕获您预期的错误，至少在错误发生时打印它们。就我而言，还有很多图像和非链接的东西，如果你想过滤某些链接，那么你必须更具体，或者寻找包含你感兴趣的标记（如果可能），使用regex

href=re.compile（“某些模式”）

或使用css选择器：

# hrefs starting with something
"a[href^=something]"

# hrefs that contain something
"a[href*=something]"

# hrefs ending with  something
"a[href$=something]"

只有你知道html的结构和你想要什么，所以你使用什么完全取决于你自己。

你想过滤掉所有不是字符串的内容吗？不，我想过滤掉所有不是真实URL的内容。非常感谢！此外，我不熟悉你所说的毛毯是什么意思，除了声明。这是否意味着我只是抓住了一些不特定的异常，而没有采取任何措施？@VishalVenkataraman我认为他的意思是，你应该排除一个特定的错误。例如

importorror

或

FileNotFoundError

，不仅仅是非常一般的

异常

。我知道如何在java中实现这一点，但如何在Python中捕获异常？@VishalVenkataraman，正如goosberry先生所说的捕获特定异常，使用

except FileNotFoundError作为e:print（e）

如果您只想要具有http或https方案的链接，请使用

“a[href^=http]”

但是如果路径是相对的，您可能会错过所需的内容。只有你知道你想要什么，答案中应该有足够的信息来找出如何得到它。