Python 如何使用条件语句筛选网页中的特定项目_Python_Web Scraping_Beautifulsoup

Python 如何使用条件语句筛选网页中的特定项目

python web-scraping

Python 如何使用条件语句筛选网页中的特定项目,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我用python做了一个刮刀。它运行平稳。现在，我想放弃或接受该页面中的特定链接，如中所示，链接只包含“手机”，但即使在做出一些有条件的声明后，我也不能这样做。希望我能得到任何帮助来纠正我的错误 import requests from bs4 import BeautifulSoup def SpecificItem(): url = 'https://www.flipkart.com/' Process = requests.get(url) soup = Beaut

我用python做了一个刮刀。它运行平稳。现在，我想放弃或接受该页面中的特定链接，如中所示，链接只包含“手机”，但即使在做出一些有条件的声明后，我也不能这样做。希望我能得到任何帮助来纠正我的错误

import requests
from bs4 import BeautifulSoup
def SpecificItem():
    url = 'https://www.flipkart.com/'
    Process = requests.get(url)
    soup = BeautifulSoup(Process.text, "lxml")
    for link in soup.findAll('div',class_='')[0].findAll('a'):
        if "mobiles" not in link:
            print(link.get('href'))
SpecificItem()

另一方面，如果我使用带有xpath的lxml库做同样的事情，它就会工作

import requests
from lxml import html
def SpecificItem():
    url = 'https://www.flipkart.com/'
    Process = requests.get(url)
    tree = html.fromstring(Process.text)
    links = tree.xpath('//div[@class=""]//a/@href')
    for link in links:
        if "mobiles" not in link:
            print(link)

SpecificItem()

因此，在这一点上，我认为使用BeautifulSoup库时，代码应该有所不同，以达到预期目的。

如果BeautifulSoup和lxml之间的条件有点不同，那么问题的根源在于您的

。基本上，如果“手机”不在链接中：
与BeautifulSoup不检查“手机”
是否在href
字段中。我没有仔细看，但我猜它是在比较链接.text
字段。明确使用href
字段可以实现以下目的：
import requests
from bs4 import BeautifulSoup
def SpecificItem():
    url = 'https://www.flipkart.com/'
    Process = requests.get(url)
    soup = BeautifulSoup(Process.text, "lxml")
    for link in soup.findAll('div',class_='')[0].findAll('a'):
        href = link.get('href')
        if "mobiles" not in href:
            print(href)
SpecificItem()

这会打印出一堆链接，其中没有一个包含“手机”。
问题的根源在于您的，如果条件在BeautifulSoup和lxml之间的工作方式有点不同。基本上，如果“手机”不在链接中：
与BeautifulSoup不检查“手机”
是否在href
字段中。我没有仔细看，但我猜它是在比较链接.text
字段。明确使用href
字段可以实现以下目的：
import requests
from bs4 import BeautifulSoup
def SpecificItem():
    url = 'https://www.flipkart.com/'
    Process = requests.get(url)
    soup = BeautifulSoup(Process.text, "lxml")
    for link in soup.findAll('div',class_='')[0].findAll('a'):
        href = link.get('href')
        if "mobiles" not in href:
            print(href)
SpecificItem()

打印出一堆链接，其中没有一个包含“手机”。
删除条件语句，然后打印所有内容。你看到了什么？谢谢你的回答。如果我删除条件语句，我可以看到该页面中所有可用的链接。事实上，无论使用if语句还是不使用if语句，我都无法看到结果中的任何更改。请删除条件语句，然后打印所有内容。你看到了什么？谢谢你的回答。如果我删除条件语句，我可以看到该页面中所有可用的链接。事实上，有if语句和没有if语句，我看不到结果有任何变化。谢谢supersam654，谢谢你的回答。我当时的想法与您在这里描述的完全相同，但我不知道如何设置“href”命令的早期打印选项。再次感谢，它解决了问题。谢谢supersam654，谢谢你的回答。我当时的想法与您在这里描述的完全相同，但我不知道如何设置“href”命令的早期打印选项。再次感谢，它解决了问题。