Python 分析子元素[BeautifulSoup]的html_Python_Beautifulsoup

Python 分析子元素[BeautifulSoup]的html

python

Python 分析子元素[BeautifulSoup]的html,python,beautifulsoup,Python,Beautifulsoup,我只有两周的时间学习python 我正在抓取一个XML文件和循环[item->description]的一个元素，里面有HTML，我怎么能从p中获取文本 url="https://www.milenio.com/rss" source=requests.get(url) soup=BeautifulSoup(source.content, features="xml") items=soup.findAll('item') for item in it

我只有两周的时间学习python

我正在抓取一个XML文件和循环[item->description]的一个元素，里面有HTML，我怎么能从p中获取文本

url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")

items=soup.findAll('item')

for item in items:
  html_text=item.description
  # This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>

所以，如果我做一个o循环，试图得到所有的p，那是行不通的

for p in html_text.find_all('p'):
  print(p)

AttributeError:'NoneType'对象没有属性'find\u all'

非常感谢你

这应该是这样的：

for item in items:
    html_text=item.description #??

    #!! dont use html_text.find_all !!
    for p in item.find_all('p'):
        print(p)

问题在于bs4是如何处理CData的（它有很好的文档记录，但没有很好地解决）

您需要从bs4导入CData，这将有助于将CData提取为字符串，并使用html.parser库，从中使用该字符串创建一个新的bs4对象，为其提供findAll属性并迭代其内容

from bs4 import BeautifulSoup, CData
import requests

url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')

items=soup.findAll('item')

for item in items:
  html_text = item.description
  findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
  newSoup = BeautifulSoup(findCdata, 'html.parser')
  paragraphs = newSoup.findAll('p')
  for p in paragraphs:
    print(p.get_text())

编辑： OP需要提取链接文本，并发现只有使用

link=item.link.nextSibling才能在item循环中提取链接文本，因为链接内容像这样跳出了标记http://www...
。在XML树视图中，这个特定的XML文档显示了一个link元素的下拉列表，这可能是原因
要从文档中未在XML树视图中显示下拉列表且没有嵌套CData的其他标记中获取内容，请将标记转换为小写并像往常一样返回文本：
item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>

item.pubdate.get_text（）#获取标记的内容
item.author.get_text（）#获取标记的内容
使用此SO链接：非常感谢，它工作得很好，但现在出现了一个问题。使用“html.parser”，链接文本会超出标记…||我正在尝试使用item.text，但不起作用。有没有办法得到这个链接？非常感谢你！这段代码起作用了，在标记“link=item.link.nextSibling
Nice”之后获取链接，奇怪的是，在您的特定情况下，链接是唯一发生这种情况的实例。例如，可以使用item.pubdate.get\u text（）抓取
标记并且它保留在标记中。这可能与您的链接元素在XML树视图中获得下拉列表有关。我将编辑答案以包含更多关于此的信息，以备将来使用。
item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>