Python 清理刮片结果以返回锚文本,但不返回HTML

Python 清理刮片结果以返回锚文本,但不返回HTML,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从给定的URL中获取曲棍球棒的价格。最后,我也想抓住名字+ URL,但我不认为有必要解决这个问题。 以下是我得到的: import requests from pandas.io.json import json_normalize from bs4 import BeautifulSoup url = 'https://www.prohockeylife.com/collections/senior-hockey-sticks' headers = {'user-agent': 'M

我正试图从给定的URL中获取曲棍球棒的价格。最后,我也想抓住名字+ URL,但我不认为有必要解决这个问题。 以下是我得到的:

import requests
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://www.prohockeylife.com/collections/senior-hockey-sticks'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

stick_names = soup.find_all(class_='product-title')
stick_prices = soup.find_all(class_='regular-product')

print(stick_prices)
上面的代码成功地返回了曲棍球棒的价格,但看起来是这样的:

[<p class="regular-product">
<span>$319.99</span>
</p>, <p class="regular-product">
<span>$339.99</span>
</p>, <p class="regular-product">
<span>$319.99</span>

但收效甚微。感谢指点

不确定,但我认为以下是您可能正在寻找的:

使用以下选项代替打印(粘贴价格):

for name,price in zip(stick_names,stick_prices):   
       print(name["href"],name.text,price.text)
输出的开始是:

    /collections/senior-hockey-sticks/products/ccm-ribcor-trigger-3d-sr-hockey-stick 

        CCM RIBCOR TRIGGER 3D SR HOCKEY STICK     

$319.99

/collections/senior-hockey-sticks/products/bauer-vapor-1x-lite-sr-hockey-stick 

        BAUER VAPOR 1X LITE SR HOCKEY STICK


$339.99

等等。

您需要.text属性,您也可以在列表理解过程中提取该属性。然后list/zip在末尾显示名称/价格的元组列表

import requests
from bs4 import BeautifulSoup

url = 'https://www.prohockeylife.com/collections/senior-hockey-sticks'
headers = {'user-agent': 'Mozilla/5.0'}   
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
stick_names = [item.text.strip() for item in soup.find_all(class_='product-title')]
stick_prices = [item.text.strip() for item in soup.find_all(class_='regular-product')]
print(list(zip(stick_names, stick_prices)))

好的,让我试着重复一下这是怎么做的。zip函数本质上是将两个变量合并到一个列表中(?-可能是一个“元组”)。然后,使用for循环,您将遍历每个项目,打印href内容(来自stick_名称)、仅来自stick_名称的文本以及仅来自stick_价格的文本。Close?@Stn-就在前端(尽管zip实际上并没有合并变量;只是从每个变量中选择一个元素)。Zip实际上非常有用-请参阅此处的更多信息,例如:
import requests
from bs4 import BeautifulSoup

url = 'https://www.prohockeylife.com/collections/senior-hockey-sticks'
headers = {'user-agent': 'Mozilla/5.0'}   
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
stick_names = [item.text.strip() for item in soup.find_all(class_='product-title')]
stick_prices = [item.text.strip() for item in soup.find_all(class_='regular-product')]
print(list(zip(stick_names, stick_prices)))