当我'；我不用它？（Python）_Python_Web Scraping_Beautifulsoup

当我'；我不用它？（Python）

python web-scraping

当我'；我不用它？（Python）,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在使用BeautifulSoup完成Python中的一个刮片任务，并且遇到了一些奇怪的错误。它提到了strip，我没有使用它，但我猜它可能与BSoup的进程有关在任务中，我试图转到原始url，找到第18个链接，单击该链接7次，然后在第7页返回第18个链接的名称结果。我试图使用一个函数从第18个链接中获取href，然后调整全局变量，使其每次使用不同的url递归。任何关于我所缺少的东西的建议都会非常有用。我将列出代码和错误： from bs4 import BeautifulSoup impo

我正在使用BeautifulSoup完成Python中的一个刮片任务，并且遇到了一些奇怪的错误。它提到了strip，我没有使用它，但我猜它可能与BSoup的进程有关

在任务中，我试图转到原始url，找到第18个链接，单击该链接7次，然后在第7页返回第18个链接的名称结果。我试图使用一个函数从第18个链接中获取href，然后调整全局变量，使其每次使用不同的url递归。任何关于我所缺少的东西的建议都会非常有用。我将列出代码和错误：

from bs4 import BeautifulSoup
import urllib
import re

nameList = []
urlToUse = "http://python-data.dr-chuck.net/known_by_Basile.html"

def linkOpen():
    global urlToUse
    html = urllib.urlopen(urlToUse)
    soup = BeautifulSoup(html, "lxml")
    tags = soup("li")
    count = 0
    for tag in tags:
        if count == 17:
            tagUrl = re.findall('href="([^ ]+)"', str(tag))
            nameList.append(tagUrl)
            urlToUse = tagUrl
            count = count + 1
        else:
            count = count + 1
            continue

bigCount = 0
while bigCount < 9:
    linkOpen()
    bigCount = bigCount + 1

print nameList[8]

从bs4导入美化组
导入URL库
进口稀土
姓名列表=[]
urlToUse=”http://python-data.dr-chuck.net/known_by_Basile.html"
def linkOpen（）：
全球统一
html=urllib.urlopen（urlToUse）
soup=BeautifulSoup（html，“lxml”）
标签=汤（“li”）
计数=0
对于标记中的标记：
如果计数=17：
tagUrl=re.findall（'href=“（[^]+）”，str（标记））
nameList.append（tagUrl）
urlToUse=tagUrl
计数=计数+1
其他：
计数=计数+1
持续
bigCount=0
当bigCount<9时：
linkOpen（）
bigCount=bigCount+1
打印姓名列表[8]

错误：

Traceback (most recent call last):
  File "assignmentLinkScrape.py", line 26, in <module>
    linkOpen()
  File "assignmentLinkScrape.py", line 10, in linkOpen
    html = urllib.urlopen(urlToUse)
  File         

"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)   File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open
    fullurl = unwrap(toBytes(fullurl))   File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap
        url = url.strip() AttributeError: 'list' object has no attribute 'strip'

回溯（最近一次呼叫最后一次）：
文件“assignmentLinkScrape.py”，第26行，在
linkOpen（）
linkOpen中第10行的文件“assignmentLinkScrape.py”
html=urllib.urlopen（urlToUse）
文件
urlopen中的第87行“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py”
返回opener.open（url）文件
“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py”，第185行，打开
fullurl=展开（toBytes（fullurl））文件
“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py”，第1075行，展开中
url=url.strip（）AttributeError:'list'对象没有属性'strip'

re.findall（）

返回匹配项列表

urlToUse

是一个列表，您正试图将其传递给

urlopen（）

，后者需要一个URL字符串。

re.findall（）

返回匹配列表

urlToUse

是一个列表，您正试图将其传递给

urlopen（）

，它需要一个URL字符串。

Alexe已经解释了您的错误，但您根本不需要正则表达式，您只需要获取第18个li标记并从其中的锚标记中提取href，您可以使用find with find_all:

或者使用css选择器：

url = soup.select_one("ul li:nth-of-type(18) a")["href"]

因此，要在访问url七次后获得名称，请将逻辑放入函数中，访问初始url，然后访问并提取锚定七次，然后在最后一次访问时仅从锚定中提取文本：

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("http://python-data.dr-chuck.net/known_by_Basile.html").content,"lxml")

def get_nth(n, soup):
    return soup.select_one("ul li:nth-of-type({}) a".format(n))

start = get_nth(18, soup)
for _ in range(7):
    soup = BeautifulSoup(requests.get(start["href"]).content,"html.parser")
    start = get_nth(18, soup)
print(start.text)

Alexe已经解释了您的错误，但您根本不需要正则表达式，您只需要获取第18个li标记并从其中的锚标记中提取href，您可以使用find with find_all：

或者使用css选择器：

url = soup.select_one("ul li:nth-of-type(18) a")["href"]

因此，要在访问url七次后获得名称，请将逻辑放入函数中，访问初始url，然后访问并提取锚定七次，然后在最后一次访问时仅从锚定中提取文本：

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("http://python-data.dr-chuck.net/known_by_Basile.html").content,"lxml")

def get_nth(n, soup):
    return soup.select_one("ul li:nth-of-type({}) a".format(n))

start = get_nth(18, soup)
for _ in range(7):
    soup = BeautifulSoup(requests.get(start["href"]).content,"html.parser")
    start = get_nth(18, soup)
print(start.text)

我已改为重新搜索，但仍有错误。当我使用str（tag）时，我也会得到与没有strip属性相关的错误<代码>属性错误：“\u sre.sre\u Match”对象没有属性“strip@McLeodx

re.search

不返回字符串，它返回一个

MatchObject

。请阅读。我已改为重新搜索，但仍有错误。当我使用str（tag）时，我也会得到与没有strip属性相关的错误<代码>属性错误：“\u sre.sre\u Match”对象没有属性“strip@McLeodx

re.search

不返回字符串，它返回一个

MatchObject

。读这本书。