Python 使用beautifulsoup在脚本标记后删除隐藏的正确日期_Python_Html_Web Scraping_Beautifulsoup_Css Selectors

Python 使用beautifulsoup在脚本标记后删除隐藏的正确日期

python html web-scraping

Python 使用beautifulsoup在脚本标记后删除隐藏的正确日期,python,html,web-scraping,beautifulsoup,css-selectors,Python,Html,Web Scraping,Beautifulsoup,Css Selectors,我想从网页上抓取日期，日期的文本（在脚本标记之后）由JavaScript注入： <div class="row"> <span class="LName"><a target="_blank" href="http://google.com">[me too]</a></span> <script language="Java

我想从网页上抓取日期，日期的文本（在脚本标记之后）由JavaScript注入：

<div class="row">
    <span class="LName"><a target="_blank" href="http://google.com">[me too]</a></span>
    <script language="Javascript" type="text/javascript">formatDate('2020,5,23,09,00,00',1)</script>6/23/2020&nbsp;10:00&nbsp;Tuesday
</div>

我尝试：

soup.select('div.row > script')[0].get_text()

"formatDate('2020,5,23,09,00,00',1)"

"\n[me too] formatDate('2020,5,23,09,00,00',1)\n"

以及：

"formatDate('2020,5,23,09,00,00',1)"

"\n[me too] formatDate('2020,5,23,09,00,00',1)\n"

当我使用Chrome检查标记时，我可以看到脚本标记后的日期文本

当我执行：

soup.select('div.row')

它返回不带日期文本的标记

JavaScript注入的日期文本，我只需要使用Beautifulsoup，而不使用selenium，日期

6/23/2020

是

标记的兄弟。您可以使用

.find\u next\u sibling（text=True）

获取此文本

例如：

txt = '''<div class="row">
    <span class="LName"><a target="_blank" href="http://google.com">[me too]</a></span>
    <script language="Javascript" type="text/javascript">formatDate('2020,5,23,09,00,00',1)</script>6/23/2020&nbsp;10:00&nbsp;Tuesday
</div>'''

soup = BeautifulSoup(txt, 'html.parser')

d = soup.select_one('div.row > script').find_next_sibling(text=True).strip()
print(d)
print(d.split()[0])

不过，你要找的日期不在脚本标记内，不是吗？你可以看到它前面的

。是的，你是对的，我编辑了标题。这回答了你的问题吗？文本是由Javascript注入的，我只需要使用Beautifulsoup，而不使用seleniumI将创建一个新问题then@KhaledKoubaa可能您看到的文本是由JavaScript注入的，所以BeautifulSoup没有看到它。尝试执行

print（soup）

并查看它是否在那里是的，我想是的，当我执行soup时。选择（'div.row'）将返回不带文本的标记，但当我在chrome上使用Inspect时，我会看到text@KhaledKoubaa是的，它是由JavaScript注入的-您可以尝试

selenium

或其他方法（

re

module、

json

module等…）您可以帮助我使用re（或其他）还有不需要硒的美容霜？拜托，我不喜欢用selenium@KhaledKoubaa为了避免评论部分混乱，我建议在StackOverflow上这里打开一个新问题。你可以在那里描述问题，然后放入有问题的URL和预期输出。

6/23/2020 10:00 Tuesday
6/23/2020