Python 3.x 如何从网站定期更改的列表中删除旧项目？_Python 3.x_Urllib_Bs4

Python 3.x 如何从网站定期更改的列表中删除旧项目？

python-3.x

Python 3.x 如何从网站定期更改的列表中删除旧项目？,python-3.x,urllib,bs4,Python 3.x,Urllib,Bs4,所以我有一个代码，可以打印阿迪达斯美国公司的所有产品，我希望它能在新产品添加到列表中时进行搜索，然后打印一个新产品。现在它只能打印出整个产品列表。我该怎么做 from bs4 import BeautifulSoup import urllib.request import re import urllib.parse import time headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWeb

所以我有一个代码，可以打印阿迪达斯美国公司的所有产品，我希望它能在新产品添加到列表中时进行搜索，然后打印一个新产品。现在它只能打印出整个产品列表。我该怎么做

from bs4 import BeautifulSoup
import urllib.request
import re
import urllib.parse
import time

headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
                "Accept-Language" : "en-US,en;q=0.8"}

url = 'http://www.adidas.com/on/demandware.static/-/Sites-adidas-US-Library/en_US/v/sitemap/product/adidas-US-en-us-product.xml'

values = {'s':'search',
'submit':'search'}


data = urllib.parse.urlencode(values)
data = data.encode('utf-8')

req = urllib.request.Request(url, data, headers=headers)
resp = urllib.request.urlopen(req)
respData = resp.read()

rawdata = re.findall(r'<loc>(.*?)</loc>', str(respData))

for Product_list in rawdata:
    print(Product_list)

从bs4导入美化组
导入urllib.request
进口稀土
导入urllib.parse
导入时间
headers={“用户代理”：“Mozilla/5.0（Windows NT 6.1；Win64；x64）AppleWebKit/537.36（KHTML，类似Gecko）Chrome/58.0.3029.110 Safari/537.36”，
“接受语言”：“en-US，en；q=0.8”}
url='1〕http://www.adidas.com/on/demandware.static/-/Sites-adidas-US-Library/en_US/v/sitemap/product/adidas-US-en-us-product.xml'
值={s'：'search'，
“提交”：“搜索”}
data=urllib.parse.urlencode（值）
data=data.encode（'utf-8'）
请求（url、数据、标题=标题）
resp=urllib.request.urlopen（req）
respData=resp.read（）
rawdata=re.findall（r'（.*？），str（respData））
对于rawdata中的产品列表：
打印（产品列表）

如果您能够定期重新收集数据，只需检查在时间B与在时间A观察到的URL相比是否有新的产品URL。下面是一个简短的示例

注意：我将

urllib

替换为

请求

。另外，您导入了

beautifulsou

，但没有使用它-我在这里使用了它，而不是

re

。这两种替代品都不是绝对必要的，它们只是我个人的喜好

from bs4 import BeautifulSoup
import requests

headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36", "Accept-Language" : "en-US,en;q=0.8"}
url = 'http://www.adidas.com/on/demandware.static/-/Sites-adidas-US-Library/en_US/v/sitemap/product/adidas-US-en-us-product.xml'
values = {'s':'search', 'submit':'search'}

# replace urllib with requests
r = requests.post(url, values, headers=headers)
soup = BeautifulSoup(r.text)

# replace re with soup
products = [str(p.text) for p in soup.find_all('loc')]

# sample output
print(products[0:5])
['http://www.adidas.com/us/ultraboost-shoes/BA8842.html',
 'http://www.adidas.com/us/ultraboost-shoes/BA8843.html',
 'http://www.adidas.com/us/crazypower-trainer-shoes/BA8929.html',
 'http://www.adidas.com/us/alphabounce-aramis-shoes/B54366.html',
 'http://www.adidas.com/us/harden-vol.-1-shoes/B39494.html']

现在让我们假设您使用相同的过程再次提取数据，并得到以下响应。第一个链接是从

产品

复制的，第二个链接是新的：

new_products = ['http://www.adidas.com/us/ultraboost-shoes/BA8842.html',
                'http://www.adidas.com/us/ultraboost-shoes/foo.html']

有许多方法可以检查一个列表中的元素是否存在于另一个列表中。我喜欢熊猫提供的

isin（）

方法：

import pandas as pd
new_products = pd.Series(new_products)

# get only new products not in old products
mask = ~new_products.isin(products)
new_products[mask].values
['http://www.adidas.com/us/ultraboost-shoes/foo.html']