Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/325.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在导出到文件/数据库之前,如何将零碎的数据折叠/分解为类似表格的格式?_Python_Web Scraping_Scrapy - Fatal编程技术网

Python 在导出到文件/数据库之前,如何将零碎的数据折叠/分解为类似表格的格式?

Python 在导出到文件/数据库之前,如何将零碎的数据折叠/分解为类似表格的格式?,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我目前正在使用scrapy来刮取amazon页面。我希望scrapy返回易于转换为表的输出(例如dataframe、MySQL等)。例如,这里是my spider在JSON文件中输出的内容(7列2行/页): 当我将其转换为数据帧时,它看起来是这样的(我仍然需要清理它): 我的问题本质上是如何折叠初始输出,使其看起来像一个表/很容易转换成一个表。如果能以某种方式将其添加到下面的解析函数中,那就太棒了。我最初尝试使用for循环来获取每个列表的第一个值。有什么想法吗?感谢您抽出时间阅读此文章 im

我目前正在使用scrapy来刮取amazon页面。我希望scrapy返回易于转换为表的输出(例如dataframe、MySQL等)。例如,这里是my spider在JSON文件中输出的内容(7列2行/页):

当我将其转换为数据帧时,它看起来是这样的(我仍然需要清理它):

我的问题本质上是如何折叠初始输出,使其看起来像一个表/很容易转换成一个表。如果能以某种方式将其添加到下面的解析函数中,那就太棒了。我最初尝试使用for循环来获取每个列表的第一个值。有什么想法吗?感谢您抽出时间阅读此文章

import scrapy
from ..items import AmazonscrapeItem

class AmazonSpiderSpider(scrapy.Spider):
    page_number = 2
    name = 'amazon_scraper'
    start_urls = [
        'https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page=1&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_1'
    ]

    def parse(self, response, **kwargs):
        items = AmazonscrapeItem()

        # if multiple classes --> .css("::text").extract()
        product_name = response.css('.a-color-base.a-text-normal::text').extract()
        product_author = response.css('.a-color-secondary .a-size-base.a-link-normal').css('::text').extract()
        product_nbr_reviews = response.css('.a-size-small .a-link-normal .a-size-base').css('::text').extract()
        product_type = response.css('.a-spacing-top-small .a-link-normal.a-text-bold').css('::text').extract()
        product_price = response.css('.a-spacing-top-small .a-price-whole').css('::text').extract()
        product_more_choice = response.css('.a-spacing-top-mini .a-color-secondary .a-link-normal').css('::text').extract()
        # this only selects the element that has the image --> need stuff inside src (source attr)
        product_imagelink = response.css('.s-image::attr(src)').extract() # want attr of src

        items['product_name'] = product_name
        items['product_author'] = product_author
        items['product_nbr_reviews'] = product_nbr_reviews
        items['product_type'] = product_type
        items['product_price'] = product_price
        items['product_more_choice'] = product_more_choice
        items['product_imagelink'] = product_imagelink


        # CAN IT BE UNPACKED HERE SOMEHOW??


        yield items

        next_page = 'https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page='+ str(AmazonSpiderSpider.page_number)+'&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_'+ str(AmazonSpiderSpider.page_number)
        if AmazonSpiderSpider.page_number <3:
            AmazonSpiderSpider.page_number += 1
            yield response.follow(next_page, callback=self.parse)
import scrapy
从..项导入AmazonscrapeItem
AmazonSpiderSpider类(scrapy.Spider):
页码=2
名称='amazon_'
起始URL=[
'https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page=1&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_1'
]
def解析(自我、响应、**kwargs):
items=AmazonscrapeItem()
#如果有多个类-->.css(“::text”).extract()
product_name=response.css('.a-color-base.a-text-normal::text').extract()
product_author=response.css('.a-color-secondary.a-size-base.a-link-normal').css('::text').extract()
product_nbr_reviews=response.css('.a-size-small.a-link-normal.a-size-base').css('::text').extract()
product_type=response.css('.a-spating-top-small.a-link-normal.a-text-bold').css('::text').extract()
product_price=response.css('.a-spating-top-small.a-price-whole').css('::text').extract()
product_more_choice=response.css('.a-spacing-top-mini.a-color-secondary.a-link-normal').css('::text').extract()
#这仅选择在src(source attr)中包含image-->need stuff的元素
product_imagelink=response.css('.s-image::attr(src)')。extract()#想要src的attr
项目['product_name']=产品名称
项目['product_author']=product_author
项目['product\u nbr\u reviews']=product\u nbr\u reviews
项目['product_type']=产品类型
项目[‘产品价格’]=产品价格
项目['product\u more\u choice']=product\u more\u choice
项目['product_imagelink']=product_imagelink
#它能在这里打开吗??
收益项目
下一页https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page=“+str(AmazonSpiderSpider.page_number)+”&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_)+str(AmazonSpiderSpider.page_number)
如果使用AmazonSpiderSpider.page_编号,请尝试使用.get()而不是extract()。Extract()将为您提供一个列表。get()提供一个字符串。熊猫可能会对输出感到困惑。