Python 将scrapy导出到csv_Python_Web Scraping_Scrapy

Python 将scrapy导出到csv

python web-scraping scrapy

Python 将scrapy导出到csv,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我要去搜刮“healthunblock.com”；我不知道为什么在CSV文件中看不到提取的数据类HealthSpider（scrapy.Spider）：名称=‘健康’ #允许的_域=['https://healthunlocked.com/positivewellbeing/posts#popular'] 起始URL=['https://healthunlocked.com/positivewellbeing/posts#popular'] itemlist=[] def解析（自我，响应）：

我要去搜刮“healthunblock.com”；我不知道为什么在CSV文件中看不到提取的数据

类HealthSpider（scrapy.Spider）：
名称=‘健康’
#允许的_域=['https://healthunlocked.com/positivewellbeing/posts#popular']
起始URL=['https://healthunlocked.com/positivewellbeing/posts#popular']
itemlist=[]
def解析（自我，响应）：
all_div_posts=response.xpath（“//div[@class=“results posts”]”）
对于所有部门职位：
项目={}
items['title']=posts.xpath（'//h3[@class=“results-post\uu title”]]/text（））.extract（）
items['post']=posts.xpath（'//div[@class=“results-post\uu body hidden xs”]/text（））.extract（）
self.itemlist.append（项目）
打开（“outputfile.csv”，“w”，newline=“”）作为f:
writer=csv.DictWriter（f，['title'，'post']）
writer.writeheader（）
对于self.itemlist中的数据：
writer.writerow（数据）

编辑：我运行您的代码，它会给我一个包含结果的文件

Scrapy可以构建它的功能，将结果保存在

CSV

中，您无需自己编写

您只需

生成项
def parse(self, response):
    
    all_div_posts = response.xpath('//div[@class="results-posts"]')
    
    for posts in all_div_posts:
        items = {} 
        items['title']= posts.xpath('//h3[@class="results-post__title"]/text()').extract()
        items['post']= posts.xpath('//div[@class="results-post__body hidden-xs"]/text()').extract()

        yield items

并使用选项-o outputfile.csv运行
scrapy runspider your_spider.py -o outputfile.csv


编辑：
我做了一些更改，现在两个版本都给出了相同的结果-我使用程序diff
检查它，以比较两个csv

因为我以不同的方式组织项目，所以我可以直接使用writer.writerows（self.itemlist）
而不使用进行循环（和zip（）
）
我还使用.get（）
而不是extract（）
（或extract\u first（）
）来获取单个标题和单个帖子以创建一对。我可以使用strip（）
清除空格
第1版
import scrapy
import csv

class HealthSpider(scrapy.Spider):
    name = 'health'
    #allowed_domains = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    start_urls = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    
    itemlist = []

    def parse(self, response):
        
        all_div_posts = response.xpath('//div[@class="results-post"]')
        print('len(all_div_posts):', len(all_div_posts))
        
        for one_post in all_div_posts:
            #print('>>>>>>>>>>>>')
            one_item = {
                'title': one_post.xpath('.//h3[@class="results-post__title"]/text()').get().strip(),
                'post': one_post.xpath('.//div[@class="results-post__body hidden-xs"]/text()').get().strip(),
            }
            self.itemlist.append(one_item)

            #yield one_item
          
                   
        with open("outputfile.csv", "w", newline="") as f:
            writer = csv.DictWriter(f, ['title','post'])
            writer.writeheader()           
            writer.writerows(self.itemlist)

第2版
import scrapy

class HealthSpider(scrapy.Spider):
    name = 'health'
    #allowed_domains = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    start_urls = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    
    #itemlist = []

    def parse(self, response):
        
        all_div_posts = response.xpath('//div[@class="results-post"]')
        print('len(all_div_posts):', len(all_div_posts))
        
        for one_post in all_div_posts:
            #print('>>>>>>>>>>>>')
            one_item = {
                'title': one_post.xpath('.//h3[@class="results-post__title"]/text()').get().strip(),
                'post': one_post.xpath('.//div[@class="results-post__body hidden-xs"]/text()').get().strip(),
            }
            #self.itemlist.append(one_item)

            yield one_item
          
                   
        #with open("outputfile.csv", "w", newline="") as f:
        #    writer = csv.DictWriter(f, ['title','post'])
        #    writer.writeheader()           
        #    writer.writerows(self.itemlist)

请尝试以下操作，以获得您在该网页中看到的确切结果。内容是动态的，您需要填充json内容以获取所需的结果。我使用定制的方法将数据写入csv文件。如果您选择下面的方法，csv文件将打开一次。但是，将数据写入文件后，该文件将被关闭
import csv
import json
import scrapy

class HealthSpider(scrapy.Spider):
    name = "health"
    start_urls = ['https://solaris.healthunlocked.com/posts/positivewellbeing/popular']

    def __init__(self):
        self.outfile = open("output.csv","w",newline="",encoding="utf-8-sig")
        self.writer = csv.writer(self.outfile)
        self.writer.writerow(['title','post content'])

    def close(self,reason):
        self.outfile.close()

    def parse(self,response):
        for posts in json.loads(response.body_as_unicode()):
            title = ' '.join(posts['title'].split())
            post = ' '.join(posts['bodySnippet'].split())
            self.writer.writerow([title,post])
            yield {'title':title,'post':post}

您可以运行scrapy-o outputfile.csv
并将结果保存在csv文件中-您不必为此编写代码-您只需生成
每行数据。但如果您确实想这样做，那么您的问题可能是“w”（写入模式），当您再次运行它时，它会删除以前的内容-并且解析会执行多次，因此它可能会多次删除以前的内容。您可以始终使用print（）
查看变量中的值-可能它永远不会获得任何数据，因此可以保存它。非常感谢，我将把每个帖子的数据保存成一行。这个命令对我来说不起作用。@TaherehMaghsoudi但这段代码将每个帖子保存在单独的行中。也许你用错误的方式打开CSV。对不起，我有点困惑，你用错误的方式打开CSV是什么意思？我运行以下命令：scrapy crawl health-o output.csvI比较您的代码和我的代码的结果奇怪的是，我的版本（带有-o output.csv
）以单独的行提供数据，但您的版本（带有csv.DictWriter
）将所有内容保存在一行中-它将包含所有post的列表转换为一个字符串。我从程序（CSV.DictWriter）中删除了与导出为CSV相关的代码，然后我尝试使用您提到的命令导出数据，但所有数据都保存在同一行中。