Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/341.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python InvalidSchema:找不到的连接适配器_Python_Web Scraping_Python Requests - Fatal编程技术网

Python InvalidSchema:找不到的连接适配器

Python InvalidSchema:找不到的连接适配器,python,web-scraping,python-requests,Python,Web Scraping,Python Requests,我使用Python2.7中的Beauty soup with requests包来处理web新闻。当我调试下面的代码时,我得到了错误 #encoding:utf-8 import re import socket import requests import httplib import urllib2 from bs4 import BeautifulSoup #headers = ('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64

我使用Python2.7中的Beauty soup with requests包来处理web新闻。当我调试下面的代码时,我得到了错误

#encoding:utf-8

import re
import socket
import requests
import httplib
import urllib2
from bs4 import BeautifulSoup

#headers = ('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0')
response = requests.get('http://www.mhi.com.my/')

class Crawler(object):
    """Crawler"""
    def __init__(self, url):
        self.url = url

    def getNextUrls(self):
        urls = []
        request = urllib2.Request(self.url)
        request.add_header('User-Agent',
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0')
        try:
            html = urllib2.urlopen(request)
        except socket.timeout, e:
            pass
        except urllib2.URLError,ee:
            pass
        except httplib.BadStatusLine:
            pass
            # analyse the txt have gotten
        soup = BeautifulSoup(response.text,'lxml')# slesct and return a list 
        pattern = 'http://www\.mhi\.com\.my/.*\.html'
        links = soup.find_all('a', href=re.compile(pattern))
        for link in links:
            urls.append(link)
        return urls

def getNews(url):
    print url
    xinwen = ''
    request = requests.get(url)
    request.add_header('User-Agent',
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0')
    try:
        html = urllib2.urlopen(request)
    except urllib2.HTTPError, e:
        print e.code

    soup = BeautifulSoup(html, 'html.parser')
    for news in soup.select('p.para'):
        xinwen += news.get_text().decode('utf-8')
    return xinwen

class News(object):
    """
    source:from where 
    title:title of news  
    time:published time of news
    content:content of news 
    type:type of news    
    """
    def __init__(self, title, time, content, type):
        self.title = title
        self.time = time
        self.content = content
        self.type = type

file = open('C:/MyFold/kiki.json', 'a')
url = "http://www.mhi.com.my"
print url
s = Crawler(url)
for newsUrl in s.getNextUrls():
    file.write(getNews(newsUrl))
    file.write("\n")
    print "---------------------------"

file.close()
这是返回的错误

C:\Python27\python.exe C:/MyFold/CodeTest/file1.py
http://www.mhi.com.my
Traceback (most recent call last):
  File "C:/MyFold/CodeTest/file1.py", line 74, in <module>
    file.write(getNews(newsUrl))
  File "C:/MyFold/CodeTest/file1.py", line 42, in getNews
    request = requests.get(url)
  File "C:\Python27\lib\site-packages\requests\api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 603, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 685, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '<a class="glow" href="http://www.mhi.com.my/akhbar2016.html" style="text-decoration: none;"></a>'
<a class="glow" href="http://www.mhi.com.my/akhbar2016.html" style="text-decoration: none;"></a>
C:\Python27\python.exe C:/MyFold/CodeTest/file1.py
http://www.mhi.com.my
回溯(最近一次呼叫最后一次):
文件“C:/MyFold/CodeTest/file1.py”,第74行,在
file.write(getNews(newsUrl))
getNews中第42行的文件“C:/MyFold/CodeTest/file1.py”
request=requests.get(url)
get中第70行的文件“C:\Python27\lib\site packages\requests\api.py”
返回请求('get',url,params=params,**kwargs)
文件“C:\Python27\lib\site packages\requests\api.py”,第56行,在请求中
return session.request(method=method,url=url,**kwargs)
文件“C:\Python27\lib\site packages\requests\sessions.py”,第488行,在请求中
resp=自我发送(准备,**发送)
文件“C:\Python27\lib\site packages\requests\sessions.py”,第603行,在send中
adapter=self.get\u适配器(url=request.url)
文件“C:\Python27\lib\site packages\requests\sessions.py”,第685行,在get\u适配器中
raise InvalidSchema(“未找到“%s”的连接适配器%url)
requests.exceptions.InvalidSchema:未找到“”的连接适配器
我的循环有问题吗?
有人能帮我吗?

在类爬虫程序中,函数
getNextUrls()
返回
列表:

[<a class="glow" href="http://www.mhi.com.my/akhbar2016.html" style="text-decoration: none;"></a>]

因此,函数
getNextUrls
将返回url列表,而不是
元素列表:

['http://www.mhi.com.my/akhbar2016.html']

当然,我会记下来的。这是我第一次提出python问题,并很快得到了很好的答案。我将进一步研究这种方法。@libolon欢迎来到stackoverflow,我们建立社区是为了互相帮助,无论多远,希望你能从这里开始一段美好的旅程:)在stackoverflow遇到这么多像你这样友好的人真是太幸运了。我相信我很快就能帮助其他人。你开始了吗?
urls.append(link.get('href'))
['http://www.mhi.com.my/akhbar2016.html']