Python 如何处理NLP的刮擦输出?

Python 如何处理NLP的刮擦输出?,python,scrapy,nlp,Python,Scrapy,Nlp,我正在尝试使用python Scrapy从公司网站提取文本数据 下面的代码可以无误地抓取文本,但输出结果似乎需要NLP进一步处理 蜘蛛代码: # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from ..items import ScrapingtestItem from scrapy

我正在尝试使用python Scrapy从公司网站提取文本数据

下面的代码可以无误地抓取文本,但输出结果似乎需要NLP进一步处理

蜘蛛代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import ScrapingtestItem
from scrapy_splash import SplashRequest
from scrapy.utils.log import configure_logging

import re

class TestscraperSpider(scrapy.Spider):
    name = 'Testscraper'
    global search_pages
    search_pages = ['https://www.impossiblefoods.com/', 'http://www.ycombinator.com/']

    list_allow = search_pages
    list_allow_parse = search_pages

    custom_settings = {"DOWNLOAD_DELAY": 1,}  # interval seconds 
    rules = (
        Rule(LinkExtractor(allow=list_allow,), follow=True),
        Rule(LinkExtractor(allow=list_allow_parse, unique=True), callback='start_requests'),
        )

    def start_requests(self):
        for url in search_pages:
            yield SplashRequest(url=url, callback=self.parse)

    def parse(self, response):
        item = ScrapingtestItem() 
        try:
            item['extracted_ptag'] = response.xpath('//p/text()').extract()
        except:
            item['extracted_ptag'] = None
        try:
            item['extracted_atag'] = response.xpath('//a/text()').extract()
        except:
            item['extracted_atag'] = None
        try:
            item['extracted_pretag'] = response.xpath('//pre/text()').extract()
        except:
            item['extracted_pretag'] = None
        try:
            item['extracted_strongtag'] = response.xpath('//strong/text()').extract()
        except:
            item['extracted_strongtag'] = None
        yield item
项目代码

# -*- coding: utf-8 -*-

import scrapy

class ScrapingtestItem(scrapy.Item):
    extracted_ptag= scrapy.Field()
    extracted_atag= scrapy.Field()
    extracted_pretag= scrapy.Field()
    extracted_strongtag= scrapy.Field()
输出json文件

[
{"extracted_text_ptag": ["It\u2019s here. A delicious burger made entirely from plants for people who love meat. No more compromises. Ready for an introduction?", "We're committed to creating a better planet and better meat, from plants. Meet heme, the magic molecule that makes it all possible.", "Turn On Sound ", "The world loves meat. But relying on cows to make meat is land-hungry, water-thirsty, and pollution-heavy. That\u2019s why we set out to do the impossible: make delicious meats that are good for people and the planet.", "It all starts with the Impossible Burger. But our world-renowned team of scientists are hard at work inventing more ways to make the foods we love most.", "We spent the past five years researching what makes meat unique: the sizzle, the smell, the juicy first bite. Then we set out to find precisely the right ingredients from the plant kingdom to recreate the experience meat lovers crave. You\u2019ve never tasted plants like this.", "Every time you choose a quarter-pound Impossible Burger instead of a burger made from a cow, you can make a huge difference without compromising.", "Welcome to the era of plant-based meat", "The patty sizzles like beef in the pan, which gets my appetite going.", "You\u2019re trying to do in meat, what Tesla did in electric cars."], "extracted_text_atag": ["Our Burger", "Locations", "About Us", "FAQs", "Press", "Blog", "Home", "Our Burger", "Locations", "About Us", "FAQs", "Press", "Blog", "Meet Heme", "About Us", "Learn More", "Read More ", "Read More ", "Read More ", "View More Articles", "twitter", "facebook", "youtube", "Our Burger", "About Us", "FAQs", "Careers", "Press", "Locations", "Facebook", "Twitter", "Instagram", "YouTube", "+1 855 877 6365", "Privacy Policy", "Terms of Service"], "extracted_text_pretag": [], "extracted_text_strongtag": []},
{"extracted_text_ptag": ["Twice a year we invest a small amount of money (", ") in a large number of startups.", "The startups move to Silicon Valley for 3 months, during which we work intensively with them to get the company into the best possible shape and refine their pitch to investors. Each cycle culminates in Demo Day, when the startups present their companies to a carefully selected, invite-only audience.", "But YC doesn't end on Demo Day. We and the YC alumni network continue to help founders for the life of their company, and beyond.", " ", " ", "\"Y Combinator is the best program for creating top-end entrepreneurs that has ever existed.\"", "\"I\u00a0doubt that Stripe would have worked\u00a0without YC. It's that simple. Acquiring early customers, figuring out who to hire, closing deals with banks, raising money --\u00a0YC's partners were closely involved and crucially helpful.\"", "\"I've been fortunate to engage with the YC community at past events over the last few years, and always walk away impressed with the passion and caliber of talent that YC brings together.\""], "extracted_text_atag": ["About", "Companies", "People", "YC Continuity", "Startup School", "Blog", "Resources", "Apply", "$120k", "Learn More", "Application FAQs", "Female Founder Stories", "More Quotes", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", "About", "Contact", "Press", "Legal", "Security", "Apply"], "extracted_text_pretag": [], "extracted_text_strongtag": []}
]
对于dicts输出列表,我希望从列表中删除空白内容,如“”,或“\n”,然后将它们聚合为一句话,以便NLP。 毕竟我想做一个dict,它的键是公司的名称,值是聚合的句子,如:

{"company_name1": *aggregated_sentence1*, "company_name2": *aggregated_sentence2*}
如何处理刮痕输出? 如有任何答复/建议,将不胜感激


提前感谢。

您可以先将所有提取的
p
元素与
''连接起来。连接(项目['extracted\u text\u ptag)。但是,关于unicode字符的处理,您可能需要看看[
unidecode`](unicode)包。它可以将奇特的引号和双破折号转换为等效的ASCII码。