如何在网站中查找特定文本'；用Python和BeautifulSoup编写HTML代码？_Python_Html_Web Scraping_Beautifulsoup

如何在网站中查找特定文本'；用Python和BeautifulSoup编写HTML代码？

python html web-scraping

如何在网站中查找特定文本'；用Python和BeautifulSoup编写HTML代码？,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,这里对HTML和Python完全陌生。我想用Python搜索一个网站来查找拍卖数据。我想找到所有带有文本“磅，磅，磅”等的列表。下面是我感兴趣的HTML代码列表示例： <a class="product" href="/Item/91150404"> <div class="title"> 30.00 LB Lego Mini Figures Lego People Grab Bag

这里对HTML和Python完全陌生。我想用Python搜索一个网站来查找拍卖数据。我想找到所有带有文本“磅，磅，磅”等的列表。下面是我感兴趣的HTML代码列表示例：

    <a class="product" href="/Item/91150404">
    <div class="title">
                30.00 LB Lego Mini Figures Lego People Grab Bag
                                        <br>Bids: 7                                    </div> </a>

我也试着在这里阅读类似的问题并实现答案，但我被卡住了。任何帮助都将不胜感激！我使用的是Python 3.7.3和Beautifulsoup4。谢谢大家!

而不是：

text=re.compile('LB')

尝试：

来自bs4导入组
将日期时间导入为dt
导入请求
url='1〕https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
r=请求。获取（url）
bs=BeautifulSoup（r.text，“html.parser”）
#收集产品。
bs_products=bs.findAll（“a”，{“class”：“product”}）
#收集每个产品的列表信息。
产品=[]
对于bs_产品中的产品：
price_str=product.find（“div”，“class”：“price”}）.text.strip（）
price_int=int（“”.join（filter（lambda i:i.isdigit（），price_str）））
product={“img”：product.find（“img”，“class”：“lazy load”}）.get（“data src”），
“num”：int（product.find（“div”，“class”：“product number”}）.text.split（“：”[1]），
“title”：product.find（“div”，“class”：“title”}）.next_element.strip（），
“剩余时间”：dt.datetime.strtime（product.find（“div”），{“class”：“timer”}）.get（“数据倒计时”），%m/%d/%Y%I:%m:%S%p”），
“价格”：价格_int}
products.append（产品）
filter_LB=列表（过滤器（产品['title']中的lambda产品：“LB”，产品））
打印（过滤器磅）

产出：

[{'img'：'https://sgwproductimages.azureedge.net/109/4-16-2020/56981071672752ssdt-thumb.jpg',
“num”：91150404，
‘标题’：‘30.00磅乐高迷你人物乐高人抓包’，
“时间左”：datetime.datetime（2020,4,21,19,20），
“价格”：444500}，
{'img'：'https://sgwproductimages.azureedge.net/5/4-14-2020/814151314749m.er-thumb.jpg',
“num”：91000111，
‘标题’：‘20磅散装散装乐高积木’，
“时间左”：datetime.datetime（2020,4,19,18,6），
“价格”：4600}]

我建议您使用BS4来实现它的用途--刮取，然后使用Python来过滤您的对象。我不反对BS4可以过滤的说法，但是，我总是发现最好先实现一个通用的解决方案，然后在需要的情况下处理细节

如果您不熟悉

过滤器

，请查看文档。如果您不知道什么是

lambda

，那么它是一个用一行代码编写的函数。所有

filter

都会在对象中循环，并应用给定的

lambda

函数。无论对象在

lambda

中返回什么

True

，

过滤器

都会返回它

定义函数（a）：返回a+2 func（4）#>>6 func=λa:a+2 func（4）#>>6 快乐编程！：）

参考资料：

编辑：为了下面的讨论。假设我们希望过滤数字，使其始终大于或等于5。我们可以通过多种方式实现：

l = [1, 2, 3, 4, 5, 6, 7]

# Traditional filtering way. Makes sense.
filtered_l = []
for i in l:
    if i >= 5:
        filtered_l.append(i)

# Lambda + Filter way
filtered_l = list(filter(lambda i: i >= 5, l))

# Function + Filter Way
def filtering(i): # Notice this function returns either True or False. 
    return i >= 5
filtered_l = list(filter(filtering, l))

您可能会问，为什么我们使用

list（filter（））

而不是简单的

filter（）

。这是因为

filter

返回一个

iterable

，它最初不是一个列表。这是一个直通车。因此，我们通过将

过滤器

转换为列表来提取其资源。类似地，您可以将

列表

转换为iterable（这为您提供了额外的功能和控制）：

l=[1,2,3,4,5]
国际热核聚变实验堆=国际热核聚变实验堆（l）#>>
下一步（国际热核实验堆1）
下一步（国际热核聚变实验堆）
下一步（国际热核聚变实验堆）3
下一步（国际热核聚变实验堆）
下一步（国际热核聚变实验堆）
下一步（国际热核实验堆）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
停止迭代

你可能会问“为什么要麻烦使用

iter

而不是简单地使用列表？”答案是因为你可以在类中重载

\uuuuuuuuuuuuuuuuuuuuu

和

\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
随机导入
类随机性：
定义（自我）：
回归自我
定义下一个（自我）：
如果是随机的。选择（[“开始”，“开始”，“停止]）=“停止”：
raise STOP迭代#表示“结束”
返回1

这允许我们迭代类本身：
用于随机数（）中的鸡蛋：
印刷品（鸡蛋）

或者，正如您在过滤器中使用的那样，只需获取以下列表：
list（RandomIterable（））
>>> [1]

在这种情况下，它将返回随机选择单词stop
的时间量（由每个1
标记）。如果返回值为[1，1]
，则连续两次选择停止。当然，这是一个愚蠢的例子，但希望现在您能看到list
、filter
和lambda
如何在Python中共同过滤列表（也称为iterables）。
另一种解决方案
from simplified_scrapy import SimplifiedDoc,req,utils
# url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
# html = req.get(url)
# url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
# html = requests.get(url).text
html = '''
<a class="product" href="/Item/91150404">
    <div class="title">
                30.00 LB Lego Mini Figures Lego People Grab Bag
                                        <br>Bids: 7
    </div>
</a>
'''
doc = SimplifiedDoc(html)
title_all = doc.getElementsByReg('( LB | LBS )',tag="div").text
print(title_all)

这里有更多的例子
 这会导致一个空列表，不知道为什么它不工作。在这种情况下使用lambda是因为它是一种更有效/优雅的递增方式，而不是外部for循环中的嵌套for循环？另外，可能是因为它是一个开放式增量？lambda
被使用（在这两种情况下），因为函数filter
接受参数filter（function，iterable）
，其中function
是一个返回true
或false
的函数，iterable是任何Python类型
（列表
，元组
，等等），可通过标准的对iterable中的i进行iterablel = [1, 2, 3, 4, 5, 6, 7]

# Traditional filtering way. Makes sense.
filtered_l = []
for i in l:
    if i >= 5:
        filtered_l.append(i)

# Lambda + Filter way
filtered_l = list(filter(lambda i: i >= 5, l))

# Function + Filter Way
def filtering(i): # Notice this function returns either True or False. 
    return i >= 5
filtered_l = list(filter(filtering, l))

from simplified_scrapy import SimplifiedDoc,req,utils
# url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
# html = req.get(url)
# url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
# html = requests.get(url).text
html = '''
<a class="product" href="/Item/91150404">
    <div class="title">
                30.00 LB Lego Mini Figures Lego People Grab Bag
                                        <br>Bids: 7
    </div>
</a>
'''
doc = SimplifiedDoc(html)
title_all = doc.getElementsByReg('( LB | LBS )',tag="div").text
print(title_all)

['30.00 LB Lego Mini Figures Lego People Grab Bag Bids: 7']