Python _刮除块引号bs4后的文本_Python_Web Scraping_Beautifulsoup_Blockquote

Python _刮除块引号bs4后的文本

python web-scraping

Python _刮除块引号bs4后的文本,python,web-scraping,beautifulsoup,blockquote,Python,Web Scraping,Beautifulsoup,Blockquote,我在HTML中有类似的内容： <tt> some text:</tt><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt&

我在HTML中有类似的内容：

<p align="left"><strong><tt>
        some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt><br/>
        some text:</tt></strong><tt>, (19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>

之后我得到了一些评论和文字的链接。我在

之后什么也拿不到。这两个块引号在页面代码中是不可见的，只有在调试Python代码时，我才能在

soup

中看到它。在

汤中

我有所有的HTML代码，但在

回合中

代码以注释后的

文本结束

是否有任何方法可以获取“我想要的链接”和“我想要的文本”？

如果您查看HTML代码，您将看到在

之前有

。这意味着您的变量

rounds

不包含您想要的链接。搜索下一个

，（9/4）

...

'''
soup=BeautifulSoup（txt，'html.parser'）
匹配的链接=汤。选择一个（'p[align=“left”]~a'）
打印（匹配的链接）

印刷品：

<a href="link i want"><tt>text i want</tt></a>

看起来您要查找的数据是动态添加到DOM中的。你应该考虑使用一个无头浏览器，使用像SeuluIM这样的工具进行擦除，但是在回合代码结束后，在评论“
文本上”，这是因为你的“p对齐”=“左”>标签结束。there@GaganTK你是对的，我错过了。谢谢你，你当然是对的。但是你的解决方案对我不起作用，
matched\u link
是空的。请您解释一下
'p[align=“left”]~a'
的确切含义是什么？@dylo
p[align=“left”]~a
是CSS选择器，它将选择前面有
元素的下一个
标记。您可以尝试
打印（soup.find_all（'a'））
并查看所需的
标记是否确实存在。
from bs4 import BeautifulSoup txt = ''' <tt> some text:</tt><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt> some text:</tt><tt>, (19/6)</tt><tt>text after comment</tt></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><tt> ... ''' soup = BeautifulSoup(txt, 'html.parser') matched_link = soup.select_one('p[align="left"] ~ a') print(matched_link)

<a href="link i want"><tt>text i want</tt></a>