Python Beautifulsoup查找特殊标记文本_Python_Beautifulsoup

Python Beautifulsoup查找特殊标记文本

python

Python Beautifulsoup查找特殊标记文本,python,beautifulsoup,Python,Beautifulsoup,我正在努力寻找并将日期中的文本转换为系统日期，并将其作为变量用于其他地方。我正在标签中查找“title”后面的日期我尝试了几种方法，但没有真正想出一个简单的解决方案。最后我用 modif_time = soup.find(text=re.compile('title')) 这是HTML代码，其中包含信息 20.7千磅 application/vnd.openxmlformats-officedocument.wordprocessingml.document r28ee854af54c 1

我正在努力寻找并将日期中的文本转换为系统日期，并将其作为变量用于其他地方。我正在标签

中查找“title”后面的日期

我尝试了几种方法，但没有真正想出一个简单的解决方案。最后我用

modif_time = soup.find(text=re.compile('title'))

这是HTML代码，其中包含信息


20.7千磅
application/vnd.openxmlformats-officedocument.wordprocessingml.document
r28ee854af54c
12分48秒前
xn06611（杰夫·门顿霍尔）

您想要什么

import datetime
dt = datetime.strptime(soup.find("span" title=True, class_='tool')["title"], "%a, %d %b %Y %H:%M:%S")

获取

span

标记的

title

属性的值

title=True

将结果限制为带有

title

属性的标记，并且

class='tool'

进一步将结果限制为那些

class

属性具有值

'tool'

的标记（

class='code>中的下划线避免与Python保留字冲突）
可以使用将其转换为datetime对象
from datetime import datetime

...

span = soup.find('span')
title = span['title']
print datetime.strptime(title, '%a, %d %b %Y %H:%M:%S')

你想要
import datetime
dt = datetime.strptime(soup.find("span" title=True, class_='tool')["title"], "%a, %d %b %Y %H:%M:%S")

获取span
标记的title
属性的值title=True
将结果限制为带有title
属性的标记，并且class='tool'
进一步将结果限制为那些class
属性具有值'tool'
的标记（class='code>中的下划线避免与Python保留字冲突）
可以使用将其转换为datetime对象
from datetime import datetime

...

span = soup.find('span')
title = span['title']
print datetime.strptime(title, '%a, %d %b %Y %H:%M:%S')

输出：
import time
import requests
from bs4 import BeautifulSoup

html = requests.get(url).content   # url you're interested in 

soup = BeautifulSoup(html)
def is_date(x):
    try:
        time.strptime(x, "%a, %d %b %Y %H:%M:%S") # Try to transform string into
        return True                               # a datetime object
    except:
        return False
print is_date("Fri, 19 Dec 2014 09:38:49") # it prints True 

res = [s for s in soup.findAll('span', title=True) if is_date(s['title'])]
print res

输出：
import time
import requests
from bs4 import BeautifulSoup

html = requests.get(url).content   # url you're interested in 

soup = BeautifulSoup(html)
def is_date(x):
    try:
        time.strptime(x, "%a, %d %b %Y %H:%M:%S") # Try to transform string into
        return True                               # a datetime object
    except:
        return False
print is_date("Fri, 19 Dec 2014 09:38:49") # it prints True 

res = [s for s in soup.findAll('span', title=True) if is_date(s['title'])]
print res

您将在下面找到如何获取所有span元素并仅保留那些日期为“title”的元素
下面是它打印的内容：
import email.utils as EU    
soup.find_all('span', title=EU.parsedate)

[
12分48秒前]
您将在下面找到如何获取所有span元素并仅保留日期为“title”的元素
下面是它打印的内容：
import email.utils as EU    
soup.find_all('span', title=EU.parsedate)

[
12分48秒前]
查找HTML中的所有span标记。您可以通过指定关键字参数来更改结果
In [112]: EU.parsedate('Fri, 19 Dec 2014 09:38:49')
Out[112]: (2014, 12, 19, 9, 38, 49, 0, 1, -1)

查找具有其标题属性的所有跨度标记
返回真实值
import bs4 as bs
import datetime as DT
import email.utils as EU    

soup = bs.BeautifulSoup(open('data'))
spans = soup.find_all('span', title=EU.parsedate)

for span in spans:
    print(span.attrs['title'])
    # Fri, 19 Dec 2014 09:38:49

    timetuple = EU.parsedate(span.attrs['title'])
    date = DT.datetime(*timetuple[:6])
    print(date)
    # 2014-12-19 09:38:49

当EU.parsedate无法解析标题时，它将返回None（Falsish值）
因此，soup.find_all（'span'，title=EU.parsedate）
只查找那些title属性看起来像日期的span标记
然后，您可以使用datetime.datetime（*timetuple[：6]）
将EU.parsedate返回的时间元组转换为

查找HTML中的所有span标记。您可以通过指定关键字参数来更改结果
In [112]: EU.parsedate('Fri, 19 Dec 2014 09:38:49')
Out[112]: (2014, 12, 19, 9, 38, 49, 0, 1, -1)

查找具有其标题属性的所有跨度标记
返回真实值
import bs4 as bs
import datetime as DT
import email.utils as EU    

soup = bs.BeautifulSoup(open('data'))
spans = soup.find_all('span', title=EU.parsedate)

for span in spans:
    print(span.attrs['title'])
    # Fri, 19 Dec 2014 09:38:49

    timetuple = EU.parsedate(span.attrs['title'])
    date = DT.datetime(*timetuple[:6])
    print(date)
    # 2014-12-19 09:38:49

当EU.parsedate无法解析标题时，它将返回None（Falsish值）
因此，soup.find_all（'span'，title=EU.parsedate）
只查找那些title属性看起来像日期的span标记
然后，您可以使用datetime.datetime（*timetuple[：6]）
将EU.parsedate返回的时间元组转换为

您要查找的是标记之间的文本，而不是带有title
属性的span
元素。您要查找的是标记之间的文本，而不是带有title
属性的span
元素。您至少可以告诉find_all（）
查找带有soup的title
属性。find_all（'span'，title=True）
。你至少可以告诉find_all（）
用soup查找title
属性。find_all（'span'，title=True）
。你至少可以告诉find（）
用soup查找title
属性。find（'span'，title=True）
。你至少可以告诉find（）
使用soup.find（'span'，title=True）
查找title
属性。