Python 如何使用BeautifulSoup提取特定的dl、dt列表元素_Python_Beautifulsoup

Python 如何使用BeautifulSoup提取特定的dl、dt列表元素

python

Python 如何使用BeautifulSoup提取特定的dl、dt列表元素,python,beautifulsoup,Python,Beautifulsoup,我正试图从这个网站上提取新闻发布的日期、链接和标题（日语）：以下是我迄今为止尝试过的代码： import requests from bs4 import BeautifulSoup r=requests.get("https://www.rinnai.co.jp/releases/index.html") c=r.content soup=BeautifulSoup(c,"html.parser") all=soup.find_all("dl",) 我的预期结果是： 2019年01月

我正试图从这个网站上提取新闻发布的日期、链接和标题（日语）：

以下是我迄今为止尝试过的代码：

import requests
from bs4 import BeautifulSoup

r=requests.get("https://www.rinnai.co.jp/releases/index.html")
c=r.content
soup=BeautifulSoup(c,"html.parser")

all=soup.find_all("dl",)

我的预期结果是：

2019年01月09日
/releases/2019/0109/index_2.html
「深型スライドオープンタイプ」食器洗い乾燥機2019年3月1日発売 食器も調理器具もまとめて入る大容量

2019年01月09日
/releases/2019/0109/index_1.html
シンプルキッチンに似合う洗練されたドロップインコンロ 2月1日新発売 耐久性に優れたステンレストッププレート仕様のグリルレスコンロ

[<dl>
<dt>2019年01月09日</dt>
<dd>
<a href="/releases/2019/0109/index_2.html">



「深型スライドオープンタイプ」食器洗い乾燥機2019年3月1日発売 食器も調理器具もまとめて入る大容量



</a></dd>
</dl>, <dl>
<dt>2019年01月09日</dt>
<dd>
<a href="/releases/2019/0109/index_1.html">



シンプルキッチンに似合う洗練されたドロップインコンロ 2月1日新発売 耐久性に優れたステンレストッププレート仕様のグリルレスコンロ



</a></dd>
</dl>, <dl>

我的实际结果是：

2019年01月09日
/releases/2019/0109/index_2.html
「深型スライドオープンタイプ」食器洗い乾燥機2019年3月1日発売 食器も調理器具もまとめて入る大容量

2019年01月09日
/releases/2019/0109/index_1.html
シンプルキッチンに似合う洗練されたドロップインコンロ 2月1日新発売 耐久性に優れたステンレストッププレート仕様のグリルレスコンロ

[<dl>
<dt>2019年01月09日</dt>
<dd>
<a href="/releases/2019/0109/index_2.html">



「深型スライドオープンタイプ」食器洗い乾燥機2019年3月1日発売 食器も調理器具もまとめて入る大容量



</a></dd>
</dl>, <dl>
<dt>2019年01月09日</dt>
<dd>
<a href="/releases/2019/0109/index_1.html">



シンプルキッチンに似合う洗練されたドロップインコンロ 2月1日新発売 耐久性に優れたステンレストッププレート仕様のグリルレスコンロ



</a></dd>
</dl>, <dl>

[
2019年01月09日
, 
2019年01月09日
,

你可以在

索引下的新闻中找到标题

div

：

from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.rinnai.co.jp/releases/index.html').text, 'html.parser')
results = [[i.find('dt').text, *(lambda x:[x.a['href'], x.text])(i)] for i in d.find('div', {'id':'index_news'}).find_all('dl')]

输出（前两篇新闻文章）：

没有必要使这一点复杂化，您已经完成了一半。您只需遍历

all

，然后从每个

dl

中获取所需的数据。然后您可以选择打印或将其保存到列表中

import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.rinnai.co.jp/releases/index.html")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find('div',id='index_news').find_all("dl")
#uncomment below line if saving to a list
#all_data=[]
for dl in all:
    date=dl.find('dt').text.strip()
    link=dl.find('a')['href'].strip()
    title=dl.find('a').text.strip()
    print(f'{date}\n{link}\n{title}\n')
    #instead of printing you can save it to a list if you want
    #uncomment below line if saving to a list
    #all_data.append([date,link,title])

输出：

2019年01月09日
/releases/2019/0109/index_2.html
「深型スライドオープンタイプ」食器洗い乾燥機2019年3月1日発売 食器も調理器具もまとめて入る大容量

2019年01月09日
/releases/2019/0109/index_1.html
シンプルキッチンに似合う洗練されたドロップインコンロ 2月1日新発売 耐久性に優れたステンレストッププレート仕様のグリルレスコンロ

...

这是一个非常棒的答案，但是否愿意扩展代码以供初学者理解？这非常有用。输出中的最后一项是：

code

['ニュースリリース', '2019/index.html'，2019年'], 这不是新闻稿。是否有办法限制输出，使其仅包含

code

中的dl项？@cyoung1989更新了答案，将其包括在内。