Python-如何拆分从html站点获取的文本
因此,我正在制作一个小脚本,每次更新我的UPS跟踪时,我都会打印出来 现在我已经完成了一个脚本,看起来像:Python-如何拆分从html站点获取的文本,python,beautifulsoup,Python,Beautifulsoup,因此,我正在制作一个小脚本,每次更新我的UPS跟踪时,我都会打印出来 现在我已经完成了一个脚本,看起来像: tracking_full_site = 'https://wwwapps.ups.com/WebTracking/track?track=yes&trackNums=' + url #URL is the last tracking numbers that I can't provide due to incase someone changes anything with
tracking_full_site = 'https://wwwapps.ups.com/WebTracking/track?track=yes&trackNums=' + url #URL is the last tracking numbers that I can't provide due to incase someone changes anything with my tracking.
headers = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
' (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')
}
resp = s.get(tracking_full_site, headers=headers, timeout=12)
resp.raise_for_status()
bs4 = soup(resp.text, 'lxml')
old_list = []
for item in bs4.findAll('tr', {'valign': 'top'}):
where_is_it = " ".join(item.text.split())
old_list.append(where_is_it)
print(old_list)
sys.exit()
然而,我得到的结果是:
United States 28.08.2018 6:16 Package departed international carrier facility
Edgewood, NY, United States 27.08.2018 20:00 Package transferred to post office
United States 27.08.2018 18:42 Package processed by international carrier
EDGEWOOD, NY, United States 24.08.2018 15:51 Package processed by UPS Mail Innovations origin facility
24.08.2018 12:55 Package received for processing by UPS Mail Innovations
United States 22.08.2018 8:19 Shipment information received by UPS Mail Innovations
这与函数“”.join(item.text.split())非常匹配。
我的问题是,我如何将其拆分,以便仅打印国家等或日期、时间或描述
编辑:
这是所有人都想看到的完整HTML:
<table summary="" border="0" cellpadding="0" cellspacing="0" class="dataTable">
<tbody>
<tr>
<th scope="col">Location</th>
<th scope="col">Date</th>
<th scope="col">Local Time</th>
<th scope="col" class="full">Activity (<a class="btnlnkR helpIconR" href="javascript:helpModLvl('https://www.ups.com/content/se/en/tracking/tracking/description.html')">What's this?</a>)</th>
</tr>
<tr valign="top">
<td class="nowrap">
United States
</td>
<td class="nowrap">
28.08.2018
</td>
<td class="nowrap">
6:16
</td>
<td>Package departed international carrier facility</td>
</tr>
<tr valign="top" class="odd">
<td class="nowrap">
Edgewood,
NY,
United States
</td>
<td class="nowrap">
27.08.2018
</td>
<td class="nowrap">
20:00
</td>
<td>Package transferred to post office</td>
</tr>
<tr valign="top">
<td class="nowrap">
United States
</td>
<td class="nowrap">
27.08.2018
</td>
<td class="nowrap">
18:42
</td>
<td>Package processed by international carrier</td>
</tr>
<tr valign="top" class="odd">
<td class="nowrap">
EDGEWOOD,
NY,
United States
</td>
<td class="nowrap">
24.08.2018
</td>
<td class="nowrap">
15:51
</td>
<td>Package processed by UPS Mail Innovations origin facility</td>
</tr>
<tr valign="top">
<td class="nowrap">
</td>
<td class="nowrap">
24.08.2018
</td>
<td class="nowrap">
12:55
</td>
<td>Package received for processing by UPS Mail Innovations</td>
</tr>
<tr valign="top" class="odd">
<td class="nowrap">
United States
</td>
<td class="nowrap">
22.08.2018
</td>
<td class="nowrap">
8:19
</td>
<td>Shipment information received by UPS Mail Innovations</td>
</tr>
</tbody>
</table>
正如你在输出中所看到的,并不是每个国家都有自己的特点。请注意这一点
对于其中一个答案,编辑:
['Sweden', '29.08.2018', '11:08', 'Package arrived at international carrier']
['United States', '28.08.2018', '6:16', 'Package departed international carrier facility']
['Edgewood,\t\t\t\t\t\t\t\n\n\t\t\t\t \n\t\t\t\t \t\n\t\t\t\t \tNY,\t\t\t\t \n\n\t\t\t\t \n\t\t\t\t \t\n\t\t\t\t \tUnited States', '27.08.2018', '20:00', 'Package transferred to post office']
['United States', '27.08.2018', '18:42', 'Package processed by international carrier']
['EDGEWOOD,\t\t\t\t\t\t\t\n\n\t\t\t\t \n\t\t\t\t \t\n\t\t\t\t \tNY,\t\t\t\t \n\n\t\t\t\t \n\t\t\t\t \t\n\t\t\t\t \tUnited States', '24.08.2018', '15:51', 'Package processed by UPS Mail Innovations origin facility']
['', '24.08.2018', '12:55', 'Package received for processing by UPS Mail Innovations']
['United States', '22.08.2018', '8:19', 'Shipment information received by UPS Mail Innovations']
获得GET响应后,将其放入变量(respString)中,然后解析它。其思想是通读html并确定信息的位置 如果您的目标是HTML的这一部分:
<tr valign="top" class="odd">
<td class="nowrap">
United States
</td>
<td class="nowrap">
22.08.2018
</td>
<td class="nowrap">
8:19
</td>
<td>Shipment information received by UPS Mail Innovations</td>
</tr>
var startIndex = respString.indexOf('<td class="nowrap">');
var tempRespString = respString.substring(startIndex);
var tempStartIndex = tempRespString.indexOf('>');
var tempEndIndex = tempRespString.indexOf('</');
var country = tempRespString.substring(tempStartIndex + 1, tempEndIndex);
美国
22.08.2018
8:19
UPS Mail Innovations收到的装运信息
这将从解析HTML中获得“美国”部分:
<tr valign="top" class="odd">
<td class="nowrap">
United States
</td>
<td class="nowrap">
22.08.2018
</td>
<td class="nowrap">
8:19
</td>
<td>Shipment information received by UPS Mail Innovations</td>
</tr>
var startIndex = respString.indexOf('<td class="nowrap">');
var tempRespString = respString.substring(startIndex);
var tempStartIndex = tempRespString.indexOf('>');
var tempEndIndex = tempRespString.indexOf('</');
var country = tempRespString.substring(tempStartIndex + 1, tempEndIndex);
var startIndex=respString.indexOf(“”);
var tempRespString=respString.substring(startIndex);
var tempStartIndex=temperpstring.indexOf('>');
var tempEndIndex=temperpstring.indexOf('你能详细说明你需要什么样的输出吗?@sauravverma在最后编辑过!那不是Javascript吗?我在使用Python:'(是的,对此很抱歉。这是我的一个工作代码,几乎完全符合您的要求。只需在Python中使用相同的逻辑…这是indexOf的用法。哦,好吧!我会试试看是否可以使用它!非常感谢:)一些国家有\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t
请运行代码,strip()将负责\n\t\t\t\t\t\t\t“国家”成员国:“{国家国家”下列国家:““Edgewood”Edgewood,t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\)])
现在似乎可以工作了!但在这种情况下,我现在如何打印出国家/地区等?(当然是在循环之外)
array = []
for item in soup.findAll('tr', {'valign': 'top'}):
array.append([f.text.strip().replace("\xa0\n\t", "") for f in item.findAll("td")])
output = []
for e in array:
output.append({"Country": e[0].replace(" ", ""), "Date": e[1], "Time": e[2], "Description": e[3]})
if you want to print only the country, just do this
for element in output:
print (element["Country"])