Python Beautifulsoup分解从元素中移除元素_Python_Web_Web Scraping_Beautifulsoup

Python Beautifulsoup分解从元素中移除元素

python web web-scraping

Python Beautifulsoup分解从元素中移除元素,python,web,web-scraping,beautifulsoup,Python,Web,Web Scraping,Beautifulsoup,我从卫报网站得到了一个数据库。我需要减少这些文件，只有文字和删除所有的广告和其他文字。我能够获取主文本，但当我尝试删除底部元素div时，attrs={class:submeta}会删除整个文本，但文本不是该元素的一部分分解对于soup.selectfigure中的remove1，attrs={class:element atom}：除去1.分解对于soup.selectaside中的remove2，attrs={data component:rich link}：除去2.分解对于so

我从卫报网站得到了一个数据库。我需要减少这些文件，只有文字和删除所有的广告和其他文字。我能够获取主文本，但当我尝试删除底部元素div时，attrs={class:submeta}会删除整个文本，但文本不是该元素的一部分

分解对于soup.selectfigure中的remove1，attrs={class:element atom}：除去1.分解对于soup.selectaside中的remove2，attrs={data component:rich link}：除去2.分解对于soup.selectdiv中的remove3，attrs={class:submeta}：除去3.分解文本提取 textHeadline=soup.findh1，attrs={class:content\uu headline} textdunderline=soup.finddiv，attrs={class:tonal_uuustandfirst} textBody=soup.finddiv，attrs={class:content\uuuu article-body from content-api js-article\uuuu body} 最后文本简化结果=strtextHeadline+strtextUnderline+strtextBody 感谢您的帮助。

使用。查找所有而不是。选择以选择要分解的元素。选择仅与CSS选择器一起使用：

for remove1 in soup.find_all("figure", attrs={"class": "element-atom"}):
    remove1.decompose()
for remove2 in soup.find_all("aside", attrs={"data-component": "rich-link"}):
    remove2.decompose()
for remove3 in soup.find_all("div", attrs={"class": "submeta"}):
    remove3.decompose()

textHeadline = soup.find("h1", attrs={"class": "content__headline"})
textUnderline = soup.find("div", attrs={"class": "tonal__standfirst"})
textBody = soup.find("div", attrs={"class": "content__article-body from-content-api js-article__body"})

# Final text
reductionResult = str(textHeadline) + str(textUnderline) + str(textBody)
print(reductionResult)

印刷品：

<h1 class="content__headline" itemprop="headline">
'Clear discrimination': South Sudanese react to exclusion from migration program
</h1><div class="tonal__standfirst u-cf">

...and so on.

使用.find_all代替.select选择要分解的元素。选择仅与CSS选择器一起使用：

for remove1 in soup.find_all("figure", attrs={"class": "element-atom"}):
    remove1.decompose()
for remove2 in soup.find_all("aside", attrs={"data-component": "rich-link"}):
    remove2.decompose()
for remove3 in soup.find_all("div", attrs={"class": "submeta"}):
    remove3.decompose()

textHeadline = soup.find("h1", attrs={"class": "content__headline"})
textUnderline = soup.find("div", attrs={"class": "tonal__standfirst"})
textBody = soup.find("div", attrs={"class": "content__article-body from-content-api js-article__body"})

# Final text
reductionResult = str(textHeadline) + str(textUnderline) + str(textBody)
print(reductionResult)

印刷品：

<h1 class="content__headline" itemprop="headline">
'Clear discrimination': South Sudanese react to exclusion from migration program
</h1><div class="tonal__standfirst u-cf">

...and so on.