Python在BeautifulSoup解析的字符串中找不到搜索词
在Python 3中,当我只想返回带有我感兴趣的术语的字符串时,我可以这样做:Python在BeautifulSoup解析的字符串中找不到搜索词,python,beautifulsoup,Python,Beautifulsoup,在Python 3中,当我只想返回带有我感兴趣的术语的字符串时,我可以这样做: phrases = ["1. The cat was sleeping", "2. The dog jumped over the cat", "3. The cat was startled"] for phrase in phrases: if "dog" in phrase: print(phrase) 上面当然印着“2.狗跳过了猫” 现在我要做的是让
phrases = ["1. The cat was sleeping",
"2. The dog jumped over the cat",
"3. The cat was startled"]
for phrase in phrases:
if "dog" in phrase:
print(phrase)
上面当然印着“2.狗跳过了猫”
现在我要做的是让相同的概念在BeautifulSoup中用于解析字符串。例如,Craigslist有很多A标签,但我们感兴趣的是其中也有“hdrlnk”的A标签。因此,我:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link:
print(link)
问题是,Python没有打印所有内部带有“hdrlnk”的A标记,而是什么也不打印。我不确定出了什么问题。试试:
for link in links:
if "hdrlnk" in link["href"]:
print(link)
尝试:
“hdrlnk”是链接上的类属性。正如您所说,您只对这些链接感兴趣,只需查找基于类的链接,如下所示:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a", {"class": "hdrlnk"})
for link in links:
print(link)
产出:
<a class="result-title hdrlnk" data-id="6293679332" href="/chc/apa/d/high-rise-2-bedroom-heated/6293679332.html">High-Rise 2 Bedroom Heated Pool Indoor Parking Fire Pit Pet Friendly!</a>
<a class="result-title hdrlnk" data-id="6285069993" href="/chc/apa/d/new-beautiful-studio-in/6285069993.html">NEW-Beautiful Studio in Uptown/free heat</a>
<a class="result-title hdrlnk" data-id="6293694090" href="/chc/apa/d/albany-park-2-bed-1-bath/6293694090.html">Albany Park 2 Bed 1 Bath Dishwasher W/D & Heat + Parking Incl Pets ok</a>
<a class="result-title hdrlnk" data-id="6282289498" href="/chc/apa/d/north-center-2-bed-1-bath/6282289498.html">NORTH CENTER: 2 BED 1 BATH HDWD AC UNITS PROVIDE W/D ON SITE PRK INCLU</a>
<a class="result-title hdrlnk" data-id="6266583119" href="/chc/apa/d/beautiful-2bed-1bath-in-the/6266583119.html">Beautiful 2bed/1bath in the heart of Wrigleyville</a>
<a class="result-title hdrlnk" data-id="6286352598" href="/chc/apa/d/newly-rehabbed-2-bedroom-unit/6286352598.html">Newly Rehabbed 2 Bedroom Unit! Section 8 OK! Pets OK! (NHQ)</a>
“hdrlnk”是链接上的类属性。正如您所说,您只对这些链接感兴趣,只需查找基于类的链接,如下所示:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a", {"class": "hdrlnk"})
for link in links:
print(link)
产出:
<a class="result-title hdrlnk" data-id="6293679332" href="/chc/apa/d/high-rise-2-bedroom-heated/6293679332.html">High-Rise 2 Bedroom Heated Pool Indoor Parking Fire Pit Pet Friendly!</a>
<a class="result-title hdrlnk" data-id="6285069993" href="/chc/apa/d/new-beautiful-studio-in/6285069993.html">NEW-Beautiful Studio in Uptown/free heat</a>
<a class="result-title hdrlnk" data-id="6293694090" href="/chc/apa/d/albany-park-2-bed-1-bath/6293694090.html">Albany Park 2 Bed 1 Bath Dishwasher W/D & Heat + Parking Incl Pets ok</a>
<a class="result-title hdrlnk" data-id="6282289498" href="/chc/apa/d/north-center-2-bed-1-bath/6282289498.html">NORTH CENTER: 2 BED 1 BATH HDWD AC UNITS PROVIDE W/D ON SITE PRK INCLU</a>
<a class="result-title hdrlnk" data-id="6266583119" href="/chc/apa/d/beautiful-2bed-1bath-in-the/6266583119.html">Beautiful 2bed/1bath in the heart of Wrigleyville</a>
<a class="result-title hdrlnk" data-id="6286352598" href="/chc/apa/d/newly-rehabbed-2-bedroom-unit/6286352598.html">Newly Rehabbed 2 Bedroom Unit! Section 8 OK! Pets OK! (NHQ)</a>
只需在链接内容中搜索术语,否则代码看起来就可以了
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link.contents[0]:
print(link)
或者,如果您想在href或title中搜索,请使用
link['href']
和link['title']
只需在链接内容中搜索词,否则您的代码就可以了
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link.contents[0]:
print(link)
或者,如果要在href或title内搜索,请使用
link['href']
和link['title']
来获取所需的链接,您可以在脚本中使用选择器使刮板更加健壮和简洁
import requests
from bs4 import BeautifulSoup
base_link = "https://chicago.craigslist.org"
res = requests.get("https://chicago.craigslist.org/search/apa").text
soup = BeautifulSoup(res, "lxml")
for link in soup.select(".hdrlnk"):
print(base_link + link.get("href"))
要获得所需的链接,您可以在脚本中使用选择器,使scraper更加健壮和简洁
import requests
from bs4 import BeautifulSoup
base_link = "https://chicago.craigslist.org"
res = requests.get("https://chicago.craigslist.org/search/apa").text
soup = BeautifulSoup(res, "lxml")
for link in soup.select(".hdrlnk"):
print(base_link + link.get("href"))
我访问了该链接,但找不到任何包含“hdrlink”文本的链接。我访问了该链接,但找不到任何包含“hdrlink”文本的链接。