Python 从亚马逊抓取标签

Python 从亚马逊抓取标签,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正试图从亚马逊抓取一个标签 因为我想把所有的产品名称和价格都删掉。刮取的数据如下所示: Title Price A 169.99 B 79.55 C 39.96 D 19.90 E 34.99 但是,我很想去掉“赞助商”标签(请参见下面截图中的黄色标记。蓝色部分是为了尊重品牌) 所需输出: Title Price Sponsored_Tag A

我正试图从亚马逊抓取一个标签

因为我想把所有的产品名称和价格都删掉。刮取的数据如下所示:

Title    Price
 A        169.99
 B        79.55
 C        39.96
 D        19.90       
 E        34.99        
但是,我很想去掉“赞助商”标签(请参见下面截图中的黄色标记。蓝色部分是为了尊重品牌)

所需输出:

Title    Price       Sponsored_Tag
 A        169.99      Yes
 B        79.55       Yes
 C        39.96       No
 D        19.90       No
 E        34.99       No 
我试过什么?

我用的是Python和Scrapy。您可以看到“test”项目,我试图通过多种方式捕捉赞助商。他们都失败了。如果我们可以在下面的代码中添加一些更改,那就太好了(因为我也将此代码用于其他流程)

非常感谢

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
#import re

class AmazonProductSpider(scrapy.Spider):
    name = "AmazonDeals"
    allowed_domains = ["amazon.com"]

    start_urls = [
            "https://www.amazon.com/s?=shaver+for+men&i=beauty&ref=nb_sb_noss_2"]

    custom_settings = {
            'FEED_URI' : 'Asin_Titles.json',
            'FEED_FORMAT' : 'json'
            }
    def parse(self, response):
        for product in response.css('.s-result-item'): 
            item = AmazonItem()

            #item['test'] = product.css('.s-info-icon').get()
            #item['test'] = product.css('.s-min-height-extra-large').get()
            item['test'] = product.css('.a-spacing-micro').get()

            yield item


class AmazonItem(scrapy.Item):
    test = scrapy.Field()


configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(AmazonProductSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
更新:这是我们在“产品”中的内容。

看起来我也没有抓到“赞助商”的标签

"items": "<div data-asin=\"B01859QHJU\" data-index=\"0\" class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n    \n\n\n\n\n\n\n\n\n<div class=\"s-expand-height s-include-content-margin s-border-bottom\">\n<div class=\"a-section a-spacing-medium\">\n\n\n<div class=\"sg-row\">\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        <div class=\"a-section a-spacing-micro s-min-height-extra-large\">\n            \n                \n\n\n<span aria-label=\"Amazon's Choice\">\n    \n\n\n\n\n<a class=\"a-link-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU/ref=ice_ac_b_dpb\">\n    \n        \n            \n                \n\n\n\n\n<span data-component-type=\"s-status-badge-component\" data-component-props='{\"badgeType\":\"amazons-choice\",\"asin\":\"B01859QHJU\"}' class=\"rush-component\">\n  <div class=\"a-row a-badge-region\"><span id=\"B01859QHJU\" class=\"a-badge\" aria-labelledby=\"B01859QHJU-label B01859QHJU-supplementary\" data-a-badge-supplementary-position=\"right\" tabindex=\"0\" data-a-badge-type=\"status\"><span id=\"B01859QHJU-label\" class=\"a-badge-label\" data-a-badge-color=\"sx-gulfstream\" aria-hidden=\"true\"><span class=\"a-badge-label-inner a-text-ellipsis\">\n    \n      <span class=\"a-badge-text\" data-a-badge-color=\"sx-cloud\">Amazon's </span>\n    \n      <span class=\"a-badge-text\" data-a-badge-color=\"ac-orange\">Choice</span>\n    \n  </span></span><span id=\"B01859QHJU-supplementary\" class=\"a-badge-supplementary-text a-text-ellipsis\" aria-hidden=\"true\">for electric razor</span></span></div>\n</span>\n\n            \n        \n        \n    \n</a>\n\n</span>\n\n            \n        </div>\n    </div></div>\n</div>\n\n<div class=\"sg-row\">\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        <div class=\"a-section a-spacing-none\">\n            \n\n\n\n\n\n<span data-component-type=\"s-product-image\" class=\"rush-component\">\n    \n    <a class=\"a-link-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU\">\n        <div class=\"a-section aok-relative s-image-square-aspect\">\n            \n                \n                    <img src=\"https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL320_.jpg\" class=\"s-image\" alt=\"Philips Norelco Electric Shaver 2100, S1560/81\" srcset=\"https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL320_.jpg 1x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL480_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL640_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL800_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL960_QL65_.jpg 3x\" data-image-index=\"0\" data-image-load=\"\" data-image-latency=\"s-product-image\" data-image-source-density=\"1\" onload=\"window.uet &amp;&amp; uet('cf')\">\n                \n                \n            \n        </div>\n    </a>\n</span>\n\n        </div>\n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        <div class=\"a-section a-spacing-none a-spacing-top-small\">\n            \n\n\n\n\n<h2 class=\"a-size-mini a-spacing-none a-color-base s-line-clamp-4\">\n    \n    \n        \n\n\n\n\n<a class=\"a-link-normal a-text-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU\">\n    \n        \n            \n                <span class=\"a-size-base-plus a-color-base a-text-normal\">Philips Norelco Electric Shaver 2100, S1560/81</span>\n            \n        \n        \n    \n</a>\n\n    \n</h2>\n\n        </div>\n        \n            <div class=\"a-section a-spacing-none a-spacing-top-micro\">\n                <div class=\"a-row a-size-small\">\n\n\n<span aria-label=\"4.1 out of 5 stars\">\n    \n\n\n\n\n\n\n    \n        <span class=\"a-declarative\" data-action=\"a-popover\" data-a-popover='{\"max-width\":\"700\",\"closeButton\":false,\"position\":\"triggerBottom\",\"url\":\"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&amp;asin=B01859QHJU&amp;ref=acr_search__popover&amp;contextId=search\"}'>\n            \n            <a href=\"javascript:void(0)\" class=\"a-popover-trigger a-declarative\"><i class=\"a-icon a-icon-star-small a-star-small-4 aok-align-bottom\"><span class=\"a-icon-alt\">4.1 out of 5 stars</span></i><i class=\"a-icon a-icon-popover\"></i></a>\n        </span>\n    \n    \n\n\n</span>\n\n\n\n<span aria-label=\"3,260\">\n    \n\n\n\n\n<a class=\"a-link-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU#customerReviews\">\n    \n        \n            \n                <span class=\"a-size-base\">3,260</span>\n            \n        \n        \n    \n</a>\n\n</span>\n</div>\n            </div>\n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        \n            <div class=\"a-section a-spacing-none a-spacing-top-small\">\n                <div class=\"a-row a-size-base a-color-base\"><div class=\"a-row\">\n\n\n\n\n<a class=\"a-size-base a-link-normal s-no-hover a-text-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU\">\n    \n        \n            \n                <span class=\"a-price\" data-a-size=\"l\" data-a-color=\"base\"><span class=\"a-offscreen\">$39.96</span><span aria-hidden=\"true\"><span class=\"a-price-symbol\">$</span><span class=\"a-price-whole\">39<span class=\"a-price-decimal\">.</span></span><span class=\"a-price-fraction\">96</span></span></span>\n            \n        \n        \n    \n</a>\n</div></div>\n            </div>\n        \n        \n            <div class=\"a-section a-spacing-none a-spacing-top-micro\">\n                <div class=\"a-row a-size-base a-color-secondary s-align-children-center\"><div class=\"a-row s-align-children-center\">\n\n\n\n\n<span class=\"aok-inline-block s-image-logo-view\">\n  <span class=\"aok-relative s-icon-text-medium s-prime\">\n    <i class=\"a-icon a-icon-prime a-icon-medium\" role=\"img\" aria-label=\"Amazon Prime\"></i>\n  </span>\n  <span>\n    \n  </span>\n</span>\n\n\n\n<span aria-label=\"Get it as soon as Tomorrow, Jul 11\">\n    <span>Get it as soon as </span><span class=\"a-text-bold\">Tomorrow, Jul 11</span>\n</span>\n</div><div class=\"a-row\">\n\n\n<span aria-label=\"FREE Shipping by Amazon\">\n    <span>FREE Shipping by Amazon</span>\n</span>\n</div></div>\n            </div>\n        \n        \n        \n        \n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        \n  </div></div>\n</div>\n</div>\n</div>\n\n</div></div>",
“项目”:"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n最快明天,7月11日拿到它\n\n\n\n\n\n\n亚马逊免费送货\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n,

您可以使用CSS选择器
:contains(“赞助商”)
测试结果是否为广告:

import requests
from bs4 import BeautifulSoup
from textwrap import shorten

url = 'https://www.amazon.com/s?k=shaver+for+men&i=beauty&ref=nb_sb_noss_2'
headers={'User-Agent':'Mozilla/5.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

print('{: ^55}{: ^12}{: ^13}'.format('Title', 'Price', 'Sponsored_Tag'))
for div in soup.select('div[data-asin]'):
    title, price = div.select_one('span.a-text-normal').text, div.select_one('.a-offscreen').text if div.select_one('.a-offscreen') else '-'
    sponsored = 'Yes' if div.select_one('span:contains("Sponsored")') else 'No'
    print('{: <55}{: ^12}{: ^13}'.format(shorten(title, 55), price, sponsored))

您可以使用CSS选择器
:contains(“赞助商”)
测试结果是否为ad:

import requests
from bs4 import BeautifulSoup
from textwrap import shorten

url = 'https://www.amazon.com/s?k=shaver+for+men&i=beauty&ref=nb_sb_noss_2'
headers={'User-Agent':'Mozilla/5.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

print('{: ^55}{: ^12}{: ^13}'.format('Title', 'Price', 'Sponsored_Tag'))
for div in soup.select('div[data-asin]'):
    title, price = div.select_one('span.a-text-normal').text, div.select_one('.a-offscreen').text if div.select_one('.a-offscreen') else '-'
    sponsored = 'Yes' if div.select_one('span:contains("Sponsored")') else 'No'
    print('{: <55}{: ^12}{: ^13}'.format(shorten(title, 55), price, sponsored))

@Roverflow我不知道Scrapy,但您可以尝试测试
product.css('span:contains(“赞助商”)'))
存在…正如我所说,我不使用Scrapy,所以它可能不起作用。contains css选择器在Scrapy中运行良好,因此您的解决方案应该可以工作!@Roverflow我建议调试打印
产品
包含的标签,并查看是否有
赞助商
字符串。@Roverflow您似乎正在使用较旧版本的BeautifulSoup。我很抱歉在版本
beautifulsoup4==4.7.1
@Roverflow上,这是一个不适合此评论部分的问题。最好打开一个新问题。@Roverflow我不知道Scrapy,但您可以尝试测试
product.css('span:contains(“赞助商”))
存在…正如我所说,我不使用Scrapy,所以它可能不起作用。contains css选择器在Scrapy中运行良好,因此您的解决方案应该可以工作!@Roverflow我建议调试打印
产品
包含的标签,并查看是否有
赞助商
字符串。@Roverflow您似乎正在使用较旧版本的BeautifulSoup。我很抱歉在版本
beautifulsoup4==4.7.1
@Roverflow上,这个问题不适合此评论部分。最好打开一个新问题。