Python 2.7 如何使用单个爬网器爬网多个域？_Python 2.7_Web Scraping_Beautifulsoup_Web Crawler

Python 2.7 如何使用单个爬网器爬网多个域？

python-2.7 web-scraping web-crawler

Python 2.7 如何使用单个爬网器爬网多个域？,python-2.7,web-scraping,beautifulsoup,web-crawler,Python 2.7,Web Scraping,Beautifulsoup,Web Crawler,如何使用单个爬网器从多个域爬网数据。我已经使用BeautifulSoup对单个站点进行了爬网，但我不知道如何创建一个通用站点这个问题是有缺陷的，比如说，你想要删除的网站必须有一些共同点 from bs4 import BeautifulSoup from urllib import request import urllib.request for counter in range(0,10): # site = input("Type the name of you

如何使用单个爬网器从多个域爬网数据。我已经使用BeautifulSoup对单个站点进行了爬网，但我不知道如何创建一个通用站点

这个问题是有缺陷的，比如说，你想要删除的网站必须有一些共同点

from bs4 import BeautifulSoup
from urllib import request
import urllib.request

for counter in range(0,10):        
    # site = input("Type the name of your website") Python 3+
    site = raw_input("Type the name of your website")
    # Takes the website you typed and stores it in > site < variable
    make_request_to_site = request.urlopen(site).read()
    # Makes a request to the site that we stored in a var
    soup = BeautifulSoup(make_request_to_site, "html.parser")
    # We pass it through BeautifulSoup parser in this case html.parser
    # Next we make a loop to find all links in the site that we stored
    for link in soup.findAll('a'):
        print link['href']

从bs4导入美化组
从urllib导入请求
导入urllib.request
对于范围（0,10）内的计数器：
#site=input（“键入网站名称”）Python 3+
site=原始输入（“键入网站名称”）
#获取您键入的网站并将其存储在>站点<变量中
将请求发送到站点=request.urlopen（site.read（））
#向存储在var中的站点发出请求
soup=BeautifulSoup（向站点“html.parser”发出请求）
#在本例中，我们将其传递给BeautifulSoup解析器html.parser
#接下来我们进行循环，查找存储的站点中的所有链接
对于soup.findAll（'a'）中的链接：
打印链接['href']

如上所述，每个站点都有自己独特的选择器设置（，等等）。一个普通的爬虫程序将无法进入一个url，并直观地了解刮什么

BeautifulSoup可能不是此类请求的最佳选择。Scrapy是另一个比BS4更健壮的web爬虫库

关于stackoverflow的类似问题：

不完整的文档：