Python BeautifulSoup如何获取最新选择器的数据
发送PythonHTTP请求后,它的响应(数据)有一个html页面,其中包含许多ABCD块。这里有一个片段Python BeautifulSoup如何获取最新选择器的数据,python,beautifulsoup,Python,Beautifulsoup,发送PythonHTTP请求后,它的响应(数据)有一个html页面,其中包含许多ABCD块。这里有一个片段 <tr> <td class="success"></td> <td class="truncate">ABCD</td> <td>12/
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/18/2018 21:45</td>
<td>12/18/2018 21:46</td>
<td>10</td>
<td>10</td>
<td>100.0</td>
<td><span class="label success">Success</span></td>
<td>SMS</td>
<td>
<a data-id="134717" class="btn" title="Go">View</a>
</td>
</tr>
你需要这个。显然,无法判断哪个
中包含日期,因此您只需迭代匹配tr中的所有td,并尝试解析datetime,如果datetime解析成功,只需将其附加到特定id的日期列表中。在获得每个id的所有日期之后,你只要在他们身上找到最新的
from dateutil import parser as du_parser
from collections import defaultdict
from bs4 import BeautifulSoup as BS
data = "<tr><td class=\"success\"></td><td class=\"truncate\">ABCD</td><td>12/18/2018 21:45</td><td>12/18/2018 21:46</td><td>10</td><td>10</td><td>100.0</td><td><span class=\"label success\">Success</span></td><td>SMS</td><td><a data-id=\"134717\" class=\"btn\" title=\"Go\">View</a></td></tr>"
b1 = BS(data, "html.parser")
td_of_interest = b1.find_all("td")
tr_that_contain_our_td = [x.parent for x in b1.find_all("td", string="ABCD")]
ids_dict = defaultdict(list)
# iterate over matched tr's to get their dates
for tr in tr_that_contain_our_td:
extracted_id = tr.find("a")['data-id']
for td in tr.find_all("td"):
try:
if len(td.contents) > 0:
actual_date = du_parser.parse(td.contents[0])
ids_dict[extracted_id].append(actual_date)
except ValueError:
pass #nothing to do here
ids_dict = {k: max(v) for k, v in ids_dict.items()}
print(ids_dict)
从dateutil导入解析器作为du_解析器
从集合导入defaultdict
从bs4导入BeautifulSoup作为BS
data=“ABCD12/18/2018 21:4512/18/2018 21:461010100.0成功SMSView”
b1=BS(数据,“html.parser”)
感兴趣的td=b1。查找所有(“td”)
tr_that_包含我们的_td=[b1中x的x.parent.find_all(“td”,string=“ABCD”)]
ids_dict=defaultdict(列表)
#迭代匹配的tr以获取其日期
对于包含我们td的tr中的tr:
提取的id=tr.find(“a”)['data-id']
对于tr.find_all(“td”)中的td:
尝试:
如果长度(td.内容)>0:
实际\u日期=du\u parser.parse(td.contents[0])
ids\U dict[提取的\U id]。追加(实际\U日期)
除值错误外:
这里没什么事可做
ids_dict={k:max(v)表示k,ids_dict.items()中的v
打印(ID_dict)
假设html遵循相同的模式:
鉴于:
html = ''' <tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/18/2018 21:45</td>
<td>12/18/2018 21:46</td>
<td>10</td>
<td>10</td>
<td>100.0</td>
<td><span class="label success">Success</span></td>
<td>SMS</td>
<td>
<a data-id="134717" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/20/2018 21:45</td>
<td>12/20/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="9913471799" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/22/2018 21:45</td>
<td>12/22/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="found the latest date" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/21/2018 21:45</td>
<td>12/21/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="9913471799" class="btn" title="Go">View</a>
</td>
</tr>'''
如果
数据id
的数量在增加,您可以使用max()
选择具有最高数据id
值的标记
我想你必须反复浏览,否则你怎么知道你是否有最新的?我可能会迭代并创建一个以日期为键的字典,然后以数据id为值。然后在你有了所有的日期之后:id,得到最近的日期(关键)。这起作用了,谢谢你Andrei,非常优雅和正确的回答。我试过了,但没有得到结果,id1结果为无
html = ''' <tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/18/2018 21:45</td>
<td>12/18/2018 21:46</td>
<td>10</td>
<td>10</td>
<td>100.0</td>
<td><span class="label success">Success</span></td>
<td>SMS</td>
<td>
<a data-id="134717" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/20/2018 21:45</td>
<td>12/20/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="9913471799" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/22/2018 21:45</td>
<td>12/22/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="found the latest date" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/21/2018 21:45</td>
<td>12/21/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="9913471799" class="btn" title="Go">View</a>
</td>
</tr>'''
import bs4
import re
import datetime
dates_list = []
soup = bs4.BeautifulSoup(html, 'html.parser')
for i in soup.select("td.truncate"):
#print(i.parent.text)
match = re.search(r'\d{2}/\d{2}/\d{4}', i.parent.text)
date = datetime.datetime.strptime(match.group(), '%m/%d/%Y').date()
date = date.strftime('%m/%d/%Y')
dates_list.append(date)
dates_list.sort()
most_recent = dates_list[-1]
rows = soup.find_all('tr')
for row in rows:
if str(most_recent) in row.text:
id1 = row.find("a").get('data-id')
print (id1)
recentDataID = max([x.get('data-id') for x in soup.select("a[data-id]")])
print(recentDataID)
# if you want to select the parent or `tr`
mostRecentRow = soup.select_one('a[data-id=%s]' % recentDataID).parent.parent