Python BeautifulSoup 4：从不同的ptag中提取多个标题和链接_Python_Web Scraping_Beautifulsoup_Web Crawler_Scrapy Spider

Python BeautifulSoup 4：从不同的ptag中提取多个标题和链接

python web-scraping web-crawler

Python BeautifulSoup 4：从不同的ptag中提取多个标题和链接,python,web-scraping,beautifulsoup,web-crawler,scrapy-spider,Python,Web Scraping,Beautifulsoup,Web Crawler,Scrapy Spider,HTML代码： <div> <p class="title"> <a href="/news/123456">title_1</a> </p> </div> <div> <p class="title"> <a href="/news/789000">title_2</a> </p> </div

HTML代码：

<div>
    <p class="title">
       <a href="/news/123456">title_1</a> 
    </p>
</div>

<div>
    <p class="title">
       <a href="/news/789000">title_2</a> 
    </p>
</div>

大家好，我需要一些帮助，我的任务是从网页中提取标题和链接，我可以提取标题，但不能提取链接。当我尝试刮取链接时，我只成功刮取了第一个链接，以下链接被忽略并替换为第一个刮取的链接。

您的代码中有大部分位，但只差一点点。我认为获取标题和链接最简单的方法是使用下面的链接

site = """<div>
    <p class="title">
       <a href="/news/123456">title_1</a> 
    </p>
</div>

<div>
    <p class="title">
       <a href="/news/789000">title_2</a> 
    </p>
</div>"""

s = BeautifulSoup(site, "html.parser")

for title in s.find_all('p', {'class':'title'}):
    links = [x['href'] for x in title.find_all('a', href=True)]
    line = title.get_text()
    print(line)
    print(links)

site=”“”




"""
s=BeautifulSoup（站点，“html.parser”）
对于s.find_all（'p'，{'class'：'title'}）中的标题：
links=[x['href']表示标题中的x。查找所有（'a'，href=True）]
行=标题。获取文本（）
打印（行）
打印（链接）

您可以看到links对象是一个列表，这只是为了防止出现每个标题都有多个链接的情况。

试试这种方法，它将有助于从中查找所有值

from bs4 import BeautifulSoup

text = """<div>
    <p class="title">
       <a href="/news/123456">title_1</a> 
    </p>
</div>

<div>
    <p class="title">
       <a href="/news/789000">title_2</a> 
    </p>
</div>
"""

soup = BeautifulSoup(text, 'html.parser')
for i in soup.find_all('p', attrs={'class': 'title'}):
    link = None
    if i.find('a'):
        link = i.find('a').get('href')
    print('Title:', i.get_text(strip=True), 'Link:', link)
# Output as:
# Title: title_1 Link: /news/123456
# Title: title_2 Link: /news/789000

从bs4导入美化组
text=”“”




"""
soup=BeautifulSoup（文本“html.parser”）
对于汤中的i.find_all（'p'，attrs={'class'：'title'}）：
链接=无
如果i.find（'a'）：
link=i.find（'a'）.get（'href'））
打印（'Title:'，i.get_text（strip=True），'Link:'，Link）
#输出为：
#标题：标题1链接：/news/123456
#标题：标题2链接：/news/789000

如果不检查，我想答案可能是将

p_containers=s.find（'p'，{'class'：'title'}）

更改为

p_containers=s.find_all（'p'，{'class'：'title'}）

不，我错了，回答如下！哎呀，for循环上缺少一个缩进，是Nested如果我的答案有用，你能把它标记为已接受吗

from bs4 import BeautifulSoup

text = """<div>
    <p class="title">
       <a href="/news/123456">title_1</a> 
    </p>
</div>

<div>
    <p class="title">
       <a href="/news/789000">title_2</a> 
    </p>
</div>
"""

soup = BeautifulSoup(text, 'html.parser')
for i in soup.find_all('p', attrs={'class': 'title'}):
    link = None
    if i.find('a'):
        link = i.find('a').get('href')
    print('Title:', i.get_text(strip=True), 'Link:', link)
# Output as:
# Title: title_1 Link: /news/123456
# Title: title_2 Link: /news/789000