Python 2.7 使用BeautifulSoup的Python web爬虫，获取URL时遇到问题_Python 2.7_Web Scraping_Beautifulsoup_Web Crawler

Python 2.7 使用BeautifulSoup的Python web爬虫，获取URL时遇到问题

python-2.7 web-scraping web-crawler

Python 2.7 使用BeautifulSoup的Python web爬虫，获取URL时遇到问题,python-2.7,web-scraping,beautifulsoup,web-crawler,Python 2.7,Web Scraping,Beautifulsoup,Web Crawler,因此，我试图构建一个动态网络爬虫来获取链接中的所有url链接。到目前为止，我能够得到所有章节的链接，但当我试图从每一章做章节链接时，我的输出没有打印出任何内容我使用的代码是： #########################Chapters####################### import requests from bs4 import BeautifulSoup, SoupStrainer import re base_url = "http://law.justia.

因此，我试图构建一个动态网络爬虫来获取链接中的所有url链接。到目前为止，我能够得到所有章节的链接，但当我试图从每一章做章节链接时，我的输出没有打印出任何内容

我使用的代码是：

#########################Chapters#######################

import requests
from bs4 import BeautifulSoup, SoupStrainer
import re


base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"

for title in range (1,4): 
url = base_url.format(title=title)
r = requests.get(url)

 for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
  if link.has_attr('href'):
    if 'chapt' in link['href']:
        href = "http://law.justia.com" + link['href']
        leveltwo(href)

#########################Sections#######################

def leveltwo(item_url):
 r = requests.get(item_url)
 soup = BeautifulSoup((r.content),"html.parser")
 section = soup.find('div', {'class': 'primary-content' })
 for sublinks in section.find_all('a'):
        sectionlinks = sublinks.get('href')
        print (sectionlinks)

通过对代码进行一些小的修改，我可以让它运行并输出部分。主要是，您需要修复缩进，并在调用函数之前定义函数

#########################Chapters#######################

import requests
from bs4 import BeautifulSoup, SoupStrainer
import re

def leveltwo(item_url):
    r = requests.get(item_url)
    soup = BeautifulSoup((r.content),"html.parser")
    section = soup.find('div', {'class': 'primary-content' })
    for sublinks in section.find_all('a'):
        sectionlinks = sublinks.get('href')
        print (sectionlinks)

base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"

for title in range (1,4): 
    url = base_url.format(title=title)
    r = requests.get(url)

for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
    try:
        if 'chapt' in link['href']:
            href = "http://law.justia.com" + link['href']
            leveltwo(href)
        else:
            continue
    except KeyError:
        continue
#########################Sections#######################

输出：

/codes/alabama/2015/title-3/chapter-1/section-3-1-1/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-2/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-3/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-4/index.html等

您不需要任何try/except块，您可以使用

href=True

和find或

find_all

只选择带有href的锚定标记或css select

a[href]

，如下所示，章节链接位于第一个ul中，文章标签内带有id

#maincontent

，因此您根本不需要过滤：

base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"
import requests
from bs4 import BeautifulSoup

def leveltwo(item_url):
    r = requests.get(item_url)
    soup = BeautifulSoup(r.content, "html.parser")
    section_links = [a["href"] for a in soup.select('div .primary-content a[href]')]
    print (section_links)



for title in range(1, 4):
    url = base_url.format(title=title)
    r = requests.get(url)
    for link in BeautifulSoup(r.content, "html.parser").select("#maincontent ul:nth-of-type(1) a[href]"):
        href = "http://law.justia.com" + link['href']
        leveltwo(href)

如果要使用find_all，只需传递

find_all（..，href=True）

即可过滤锚定标记，以仅选择具有href的锚定标记

谢谢！我知道你做了什么。如果我为第一部分的章节定义了一个函数，然后为第二节定义了一个函数来引用第一部分，它会起作用吗？是的，如果我理解正确的话。如果这两个“部分”都用函数包装，阅读起来也会更好；）