Python 确定网站是否为网络商店_Python_Python 3.x_Selenium_Web Scraping_Beautifulsoup

Python 确定网站是否为网络商店

python python-3.x selenium web-scraping

Python 确定网站是否为网络商店,python,python-3.x,selenium,web-scraping,beautifulsoup,Python,Python 3.x,Selenium,Web Scraping,Beautifulsoup,我试图从网站列表中确定一个网站是否是一个网络商店似乎大多数网络商店都有： a标签，其href 一个li标记，分配给类名中带有单词“cart”的类我相信我必须利用正则表达式，然后告诉BeautifulSoupfind方法在a或li标记中搜索该正则表达式的HTML数据。我该怎么做到目前为止，下面的代码在HTML数据中搜索带有href的标记代码 import re from bs4 import BeautifulSoup from selenium import webdriver w

我试图从网站列表中确定一个网站是否是一个网络商店

似乎大多数网络商店都有：

```
a
```
标签，其
```
href
```
一个
```
li
```
标记，分配给类名中带有单词“cart”的类

我相信我必须利用正则表达式，然后告诉BeautifulSoup

find

方法在

或

li

标记中搜索该正则表达式的HTML数据。我该怎么做

到目前为止，下面的代码在HTML数据中搜索带有

href

的

标记
代码
import re
from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        cart = re.compile('.*cart.*', re.IGNORECASE)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        if soup.find('a', href=cart):
            shops.append(url)

print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

输出：
SHOPS FOUND:
https://www.nike.com/
https://www.amazon.com/

您可以将contains*运算符与css属性选择器一起使用，以指定类属性或href属性contains子字符串。将这两个（类和href）与或语法结合使用。TODO：您可以考虑添加等待条件，以确保所有<代码> LI>代码>和<代码> A<代码>标签元素。
from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        items = soup.select('a[href*=cart], li[class*=cart]')
        if len(items) > 0:
                shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

您可以将contains*运算符与css属性选择器一起使用，以指定类属性或href属性contains子字符串。将这两个（类和href）与或语法结合使用。TODO：您可以考虑添加等待条件，以确保所有<代码> LI>代码>和<代码> A<代码>标签元素。
from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        items = soup.select('a[href*=cart], li[class*=cart]')
        if len(items) > 0:
                shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

谢谢你！它似乎也在与其他网站合作。一个问题是，它不会将“GameStop”作为商店返回。但是，在检查GameStop的HTML时。我想知道为什么会这样。嗨，不知什么原因，我没有看到这种反应。我当前无法访问该url。您可以使用包含该页面的html吗。然后我会看一看。选择器是正确的，但我被拒绝访问该服务器。如果您在没有headless的情况下运行，那么使用selenium+chrome是否可以看到页面内容？谢谢！它似乎也在与其他网站合作。一个问题是，它不会将“GameStop”作为商店返回。但是，在检查GameStop的HTML时。我想知道为什么会这样。嗨，不知什么原因，我没有看到这种反应。我当前无法访问该url。您可以使用包含该页面的html吗。然后我会看一看。选择器是正确的，但我被拒绝访问该服务器。如果您在没有headless的情况下运行，那么使用selenium+chrome是否可以看到页面内容？