python中的Web scraping Remax.com_Python_Web Scraping_Beautifulsoup_Python Requests

python中的Web scraping Remax.com

python web-scraping

python中的Web scraping Remax.com,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我正试图按照教程从Remax.com中获取数据。目前，我只想得到一个特定住宅的平方英尺面积。虽然我得到了这个错误： Error during requests to https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html : HTTPSConnectionPool(host='www.remax.com', port=443): Max retri

我正试图按照教程从Remax.com中获取数据。目前，我只想得到一个特定住宅的平方英尺面积。虽然我得到了这个错误：

Error during requests to https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html : HTTPSConnectionPool(host='www.remax.com', port=443): Max retries exceeded with url: /realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-28b8e2248942> in <module>()
      1 raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
----> 2 html = BeautifulSoup(raw_html, 'html.parser')
      3 for i, li in enumerate(html.select('li')):
      4         print(i, li.text)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, **kwargs)
    190         if hasattr(markup, 'read'):        # It's a file-type object.
    191             markup = markup.read()
--> 192         elif len(markup) <= 256 and (
    193                 (isinstance(markup, bytes) and not b'<' in markup)
    194                 or (isinstance(markup, str) and not '<' in markup)

TypeError: object of type 'NoneType' has no len()

我是一个非常新的网页抓取，所以我不知道如何解决这个问题。如果您有任何建议，我们将不胜感激。

对您的问题不太确定，但如果您对该页面上房子的面积感兴趣，您可以使用

    import urllib
    from bs4 import BeautifulSoup

    url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'

    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
   'Accept-Encoding': 'none',
   'Accept-Language': 'en-US,en;q=0.8',
   'Connection': 'keep-alive'}

    request = urllib.request.Request(url, headers=hdr)
    html = urllib.request.urlopen(request).read()

    soup = BeautifulSoup(html,'html.parser')
    foot = soup.find('span', class_="listing-detail-sqft-val")
    foot.text.strip()

输出：

'7,604'

不确定你的问题，但如果你感兴趣的只是那页上房子的面积，你可以使用

    import urllib
    from bs4 import BeautifulSoup

    url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'

    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
   'Accept-Encoding': 'none',
   'Accept-Language': 'en-US,en;q=0.8',
   'Connection': 'keep-alive'}

    request = urllib.request.Request(url, headers=hdr)
    html = urllib.request.urlopen(request).read()

    soup = BeautifulSoup(html,'html.parser')
    foot = soup.find('span', class_="listing-detail-sqft-val")
    foot.text.strip()

输出：

'7,604'

如果请求失败，您的

simple_get（）

函数将返回

None

。因此，您应该在使用前对此进行测试。这可以通过以下方式实现：

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """

    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()

    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
raw_html = simple_get(url)

if raw_html:
    html = BeautifulSoup(raw_html, 'html.parser')

    for i, li in enumerate(html.select('li')):
            print(i, li.text)
else:
    print(f"get failed for '{url}'")

因此，为了简化，下面将给出相同的错误消息：

from bs4 import BeautifulSoup

html = BeautifulSoup(None, 'html.parser')

如果请求失败，您的

simple_get（）

函数将返回

None

。因此，您应该在使用前对此进行测试。这可以通过以下方式实现：

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """

    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()

    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
raw_html = simple_get(url)

if raw_html:
    html = BeautifulSoup(raw_html, 'html.parser')

    for i, li in enumerate(html.select('li')):
            print(i, li.text)
else:
    print(f"get failed for '{url}'")

因此，为了简化，下面将给出相同的错误消息：

from bs4 import BeautifulSoup

html = BeautifulSoup(None, 'html.parser')

你确定要放弃整个互联网吗？我想你宁愿刮它，而不是你的

simple\u get（）

函数可以返回

None

。如果发生这种情况，您不进行测试，即

raw\u html

将为

None

。您确定要放弃整个internet吗？我想你宁愿刮它，而不是你的

simple\u get（）

函数可以返回

None

。如果发生这种情况，您不进行测试，例如，

raw\u html

将是

None

。我没有得到相同的输出，我只是得到了一个与上述相同的错误。我使用了与您不同的库，尽管它们应该产生相同的输出。在任何情况下，我都编辑了我的答案以包含整个代码。如何获得hdr？完全使用您的代码，我得到以下信息：--------------------------------------------------------------------------------------在（）15 16 soup=BeautifulSoup（html，'html.parser'）中键入错误回溯（最近一次调用）-->17 foot=html.find（'span'，class=“listing detail sqft val”）18 foot.text.strip（）类型错误：find（）不带关键字参数您是对的；

foot

变量中的输入错误；答案中已修复。我没有得到相同的输出，我只是得到了一个与上面描述的相同的错误。我使用了与您的库不同的库，尽管它们应该得到相同的输出。在任何情况下，我都编辑了我的答案以包含整个代码。如何获得hdr？完全使用您的代码，我得到以下信息：--------------------------------------------------------------------------------------在（）15 16 soup=BeautifulSoup（html，'html.parser'）中键入错误回溯（最近一次调用）-->17 foot=html.find（'span'，class=“listing detail sqft val”）18 foot.text.strip（）类型错误：find（）不带关键字参数您是对的；

foot

变量中的输入错误；答案是固定的。