Python 使用href引用浏览网站_Python_Recursion_Web Crawler

Python 使用href引用浏览网站

python recursion web-crawler

Python 使用href引用浏览网站,python,recursion,web-crawler,Python,Recursion,Web Crawler,我正在使用scrapy，我想浏览www.rentler.com。我已访问该网站并搜索了我感兴趣的城市，以下是该搜索结果的链接： https://www.rentler.com/search?Location=millcreek&MaxPrice= 现在，我感兴趣的所有列表都包含在该页面上，我想一个接一个地递归浏览它们每个列表都列在以下列表中： <body>/<div id="wrap">/<div class="container search-res"

我正在使用scrapy，我想浏览www.rentler.com。我已访问该网站并搜索了我感兴趣的城市，以下是该搜索结果的链接：

https://www.rentler.com/search?Location=millcreek&MaxPrice=

现在，我感兴趣的所有列表都包含在该页面上，我想一个接一个地递归浏览它们

每个列表都列在以下列表中：

<body>/<div id="wrap">/<div class="container search-res">/<ul class="search-results"><li class="result">

更新 *谢谢你的意见。这是我现在拥有的，它似乎在运行，但不会刮伤：*

重新导入
从scrapy.contrib.spider导入爬行蜘蛛，规则
从scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
从scrapy.selector导入HtmlXPathSelector
从KSL.items导入KSLitem
KSL类（爬行蜘蛛）：
name=“ksl”
允许的_域=[”https://www.rentler.com"]
起始URL=[”https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978"]
regex_pattern='如果需要从html文件中提取数据，我建议您使用，它非常易于安装和使用：
from bs4 import BeautifulSoup

bs = BeautifulSoup(html)
for link in bs.find_all('a'):
    if link.has_attr('href'):
        print link.attrs['href']

这个小脚本将获得HTML标记中的所有href

编辑：全功能脚本：
我在我的电脑上进行了测试，结果正如预期的那样，BeautifulSoup需要纯HTML，您可以从中获取所需内容，请查看以下代码：
import requests
from bs4 import BeautifulSoup

html = requests.get(
    'https://www.rentler.com/search?Location=millcreek&MaxPrice=').text
bs = BeautifulSoup(html)
possible_links = bs.find_all('a')
for link in possible_links:
    if link.has_attr('href'):
        print link.attrs['href']

这只显示了如何从您试图刮取的html页面中刮取href，当然您可以在scrapy中使用它，正如我告诉您的，BeautifulSoup只需要纯html，这就是为什么我使用requests.get（url.text
，您可以从中刮取。所以我想scrapy可以将简单的HTML传递给BeautifulSoup
编辑2
好的，听着，我认为你根本不需要scrapy，所以如果前面的脚本为你提供了从works获取数据的所有链接，你只需要做如下操作：
rules = (Rule(SgmlLinkExtractor(allow="not sure what to insert here, but this is where I think I need to href appending", callback='parse_item', follow=true),)

假设我有一个有效的URL列表，我想从中获取特定数据，比如价格、英亩数、地址。。。您可以仅在上一个脚本中使用此选项，而不是将URL打印到屏幕上。您可以将它们附加到列表中，并仅附加以/listing/
开头的URL。这样您就有了一个有效的URL列表
for url in valid_urls:
    bs = BeautifulSoup(requests.get(url).text)
    price = bs.find('span', {'class': 'amount'}).text
    print price

您只需查看源代码，就可以了解如何从每个url中提取所需的数据。
如果需要从html文件中提取数据，我建议您使用，它非常易于安装和使用：
from bs4 import BeautifulSoup

bs = BeautifulSoup(html)
for link in bs.find_all('a'):
    if link.has_attr('href'):
        print link.attrs['href']

这个小脚本将获得HTML标记中的所有href

编辑：全功能脚本：
我在我的电脑上进行了测试，结果正如预期的那样，BeautifulSoup需要纯HTML，您可以从中获取所需内容，请查看以下代码：
import requests
from bs4 import BeautifulSoup

html = requests.get(
    'https://www.rentler.com/search?Location=millcreek&MaxPrice=').text
bs = BeautifulSoup(html)
possible_links = bs.find_all('a')
for link in possible_links:
    if link.has_attr('href'):
        print link.attrs['href']

这只显示了如何从您试图刮取的html页面中刮取href，当然您可以在scrapy中使用它，正如我告诉您的，BeautifulSoup只需要纯html，这就是为什么我使用requests.get（url.text
，您可以从中刮取。所以我想scrapy可以将简单的HTML传递给BeautifulSoup
编辑2
好的，听着，我认为你根本不需要scrapy，所以如果前面的脚本为你提供了从works获取数据的所有链接，你只需要做如下操作：
rules = (Rule(SgmlLinkExtractor(allow="not sure what to insert here, but this is where I think I need to href appending", callback='parse_item', follow=true),)

假设我有一个有效的URL列表，我想从中获取特定数据，比如价格、英亩数、地址。。。您可以仅在上一个脚本中使用此选项，而不是将URL打印到屏幕上。您可以将它们附加到列表中，并仅附加以/listing/
开头的URL。这样您就有了一个有效的URL列表
for url in valid_urls:
    bs = BeautifulSoup(requests.get(url).text)
    price = bs.find('span', {'class': 'amount'}).text
    print price

您只需查看源代码，就可以了解如何从每个url中提取所需的数据。
您可以使用正则表达式从链接中查找所有出租房屋ID。从那里，您可以使用您拥有的ID并刮除该页面
import re
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, SOURCE_OF_THE_RENTLER_PAGE)
for rental_id in rental_home_ids:
   #Process the data from the page here.
   print rental_id

重新导入
regex_模式=“”
rental\u home\u id=re.findall（正则表达式模式，RENTLER页面的源）
对于出租屋id中的出租屋id：
#在此处处理页面中的数据。
打印租赁id

编辑：
下面是一个正在开发自己版本的代码的示例。它打印所有链接ID。你可以按原样使用它
import re
import urllib
url_to_scrape = "https://www.rentler.com/search?Location=millcreek&MaxPrice="
page_source = urllib.urlopen(url_to_scrape).read()
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, page_source)
for rental_id in rental_home_ids:
   #Process the data from the page here.
   print rental_id

重新导入
导入URL库
url_to_scrape=”https://www.rentler.com/search?Location=millcreek&MaxPrice="
page_source=urllib.urlopen（url_to_scrape.read（））
regex_模式=“”
rental_home_id=re.findall（正则表达式模式，页面来源）
对于出租屋id中的出租屋id：
#在此处处理页面中的数据。
打印租赁id
您可以使用正则表达式从链接中查找所有出租房屋ID。从那里，您可以使用您拥有的ID并刮除该页面
import re
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, SOURCE_OF_THE_RENTLER_PAGE)
for rental_id in rental_home_ids:
   #Process the data from the page here.
   print rental_id

重新导入
regex_模式=“”
rental\u home\u id=re.findall（正则表达式模式，RENTLER页面的源）
对于出租屋id中的出租屋id：
#在此处处理页面中的数据。
打印租赁id

编辑：
下面是一个正在开发自己版本的代码的示例。它打印所有链接ID。你可以按原样使用它
import re
import urllib
url_to_scrape = "https://www.rentler.com/search?Location=millcreek&MaxPrice="
page_source = urllib.urlopen(url_to_scrape).read()
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, page_source)
for rental_id in rental_home_ids:
   #Process the data from the page here.
   print rental_id

重新导入
导入URL库
url_to_scrape=”https://www.rentler.com/search?Location=millcreek&MaxPrice="
page_source=urllib.urlopen（url_to_scrape.read（））
regex_模式=“”
rental_home_id=re.findall（正则表达式模式，页面来源）
对于出租屋id中的出租屋id：
#在此处处理页面中的数据。
打印租赁id
谢谢您的建议。我已经添加了代码，它运行时没有错误，但不是刮擦。你能看一下吗@GKBRKI认为我发现了错误@benknighthorse。您将链接放在re.findall（）中。相反，你需要把页面的源代码。我不知道如何处理scrapy，但这可能并不难。谢谢您的快速回复@GKBRK。RENTLER页面的源代码是什么？@benknighthorse它是页面HTML源代码。你可以看到我的爱德