Mechanize and Beautifulsoup httplib.InvalidURL错误：非数字端口：''；（Python）_Python_Beautifulsoup_Mechanize

Mechanize and Beautifulsoup httplib.InvalidURL错误：非数字端口：''；（Python）

python

Mechanize and Beautifulsoup httplib.InvalidURL错误：非数字端口：''；（Python）,python,beautifulsoup,mechanize,Python,Beautifulsoup,Mechanize,我正在浏览URL列表，并使用Mechanize/BeautifulSoup使用脚本打开它们然而，我得到了这个错误： File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 718, in _set_hostport raise InvalidURL("nonnumeric port: '%s'" % host[i+1:]) httplib.Invalid

我正在浏览URL列表，并使用Mechanize/BeautifulSoup使用脚本打开它们

然而，我得到了这个错误：

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 718, in _set_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

这发生在以下代码行：

page = mechanize.urlopen(req)

下面是我的代码。有没有发现我做错了什么？很多URL都可以工作，当它碰到某些URL时，我会收到这个错误消息，所以不知道为什么

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re, os
import shutil
import mechanize
import urllib2
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

mech = Browser()
linkfile = open ("links.txt")
urls = []
while 1:
    url = linkfile.readline()
    urls.append("%s" % linkfile.readline())
    if not url:
        break

for url in urls:
    if "http://" or "https://" not in url: 
        url = "http://" + url
    elif "..." in url:
    elif ".pdf" in url:
        #print "this is a pdf -- at some point we should save/log these"
        continue
    elif len (url) < 8:
        continue
    req = mechanize.Request(url)
    req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
    req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20100101 Firefox/17.0')
    req.add_header('Accept-Language', 'Accept-Language  en-US,en;q=0.5')
    try:
        page = mechanize.urlopen(req)
    except urllib2.HTTPError, e:
        print "there was an error opening the URL, logging it"
        print e.code
        logfile = open ("log/urlopenlog.txt", "a")
        logfile.write(url + "," + "couldn't open this page" + "\n")
        pass

从mechanize导入浏览器
从BeautifulSoup导入BeautifulSoup
导入re、os
进口舒蒂尔
进口机械化
导入urllib2
导入系统
重新加载（系统）
系统设置默认编码（“utf-8”）
mech=浏览器（）
linkfile=open（“links.txt”）
URL=[]
而1：
url=linkfile.readline（）
URL.append（“%s”%linkfile.readline（））
如果不是url：
打破
对于url中的url：
如果“http://”或“https://”不在url中：
url=“http://”+url
url中的elif“…”：
url中的elif“.pdf”：
#打印“这是一个pdf文件--在某个时候我们应该保存/记录这些文件”
持续
elif len（url）<8：
持续
req=mechanize.Request（url）
请求添加标题（'Accept'，'text/html，application/xhtml+xml，application/xml；q=0.9，*/*；q=0.8'）
请求添加标题（“用户代理”、“Mozilla/5.0（Macintosh；英特尔Mac OS X 10.8；rv:17.0）Gecko/20100101 Firefox/17.0”）
请求添加标题（'Accept-Language'，'Accept-Language en-US，en；q=0.5'）
尝试：
page=mechanize.urlopen（请求）
除urllib2.HTTPError外，e:
打印“打开URL并记录时出错”
打印电子代码
logfile=open（“log/urlopenlog.txt”、“a”）
logfile.write（url+“，“+”无法打开此页面“+”\n”）
通过
我认为这段代码
if "http://" or "https://" not in url: 

没有做你想做的（或者你认为它会做的）
将始终计算为true，因此您的URL从不加前缀。
您需要将其重写（例如）：
此外，现在我开始测试你的作品：
urls = []
while 1:
    url = linkfile.readline()
    urls.append("%s" % linkfile.readline())
    if not url:
        break

这实际上确保了URL文件的读取不正确，并且每读取第二行，您可能希望读取以下内容：
urls = []
while 1:
    url = linkfile.readline()
    if not url:
        break
    urls.append("%s" % url)

这样做的原因是-您调用了两次linkfile.readline（）
，强制它读取两行，并且只将每两行保存到列表中
另外，您希望在追加之前添加if
子句，以防止列表末尾出现空条目
但您的特定URL示例对我很有用。更多信息，我可能需要您的链接文件。
适合我<代码>http://blog.21ic.com/more.asp?id=27916
就是这样。有人知道我可以继续走下去的方法吗？它基本上停止了我的整个剧本。这并不经常发生（可能是1/25个URL）。。。但我希望它继续而不是中断。这有点奇怪，因为我认为来自page=mechanize.urlopen（req）的错误需要时间来传播。在我进一步深入代码之前，它不会引发错误。我认为你是对的，但不确定这是导致错误的原因。。。当URL尝试打开它们时，会在其前面加上前缀。我做了一份打印声明以确保这一点。请参阅我的编辑。那个特定的URL对我来说没问题，所以为了帮助你，我可能需要你的链接文件。
urls = []
while 1:
    url = linkfile.readline()
    urls.append("%s" % linkfile.readline())
    if not url:
        break

urls = []
while 1:
    url = linkfile.readline()
    if not url:
        break
    urls.append("%s" % url)