用Python检测爬网网站中的文本语言_Python_Scrapy_Web Crawler

用Python检测爬网网站中的文本语言

python scrapy web-crawler

用Python检测爬网网站中的文本语言,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我为不同的网站编写了几个不同的爬行器，输出文章文本和URL。例如： import scrapy import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from bs4 import BeautifulSoup stop_words = set(stopwords.words("german")) class FruehstueckSpider(scrapy.Spider):

我为不同的网站编写了几个不同的爬行器，输出文章文本和URL。例如：

import scrapy
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup

stop_words = set(stopwords.words("german"))


class FruehstueckSpider(scrapy.Spider):
    name = "fruestueckerinnen"

    def start_requests(self):
        urls = [
            'https://www.diefruehstueckerinnen.at/stadt/wien/',
        ]
        urls += [urls[0] + 'page/' + str(i) + '/' for i in range(1,17)]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        hrefs = response.css('div.text > a')
        yield from response.follow_all(hrefs, callback = self.parse_attr)

    def parse_attr(self, response):

        yield {
                'text': ' '.join([i for i  in word_tokenize(re.sub(pattern='[^a-zA-Z_\-ÖöÜüÄäßèé]',string=  BeautifulSoup(response.css('.content-inner.single-content').get(),"html.parser").find(class_="content-inner single-content").text , repl=' ')) if i not in stop_words and not re.match('[0-9]', i) and len(i) >1]),
                'url': response.request.url,
        }

我想检测整个文本所使用的语言。将其作为另一个属性写在“text”和“url”下有意义吗？我知道在

langdetect

中有一个名为

detect

（输入是字符串）的函数，但在这种情况下如何使用它

lang-属性应该定义页面语言的html属性。我建议您将此作为站点的参考，因为它是识别此属性的最直接方法。定义此属性是为了帮助语音软件选择正确的发音语言

...

您可以在收益率中添加另一个字段，如下所示

from langdetect import detect  # add this to your import


# change the parse_attr function like this
def parse_attr(self, response):
    text = ' '.join([i for i  in word_tokenize(re.sub(pattern='[^a-zA-Z_\-ÖöÜüÄäßèé]',string=  BeautifulSoup(response.css('.content-inner.single-content').get(),"html.parser").find(class_="content-inner single-content").text , repl=' ')) if i not in stop_words and not re.match('[0-9]', i) and len(i) >1])
    language = detect(text)

    yield {
            'text': text,
            'language': language,
            'url': response.request.url,
    }

所以，如果我对每个蜘蛛都这样做，我如何输出它？比如我举的蜘蛛为例？我在想我可以使用detect函数并实际遍历每个spider的每个文本输出，但这似乎更复杂。requesthandler=request.css，htmltag=str（str（requesthandler.html）.replace（“”）。您应该将返回的bs4对象分配给一个新对象，然后您可以检索html标记并直接将其剥离。但是，请使用更合适的方式替换requesthandler=request.css。谢谢！问题是，有时页面源代码开头的语言与文本的语言不一致，例如，他回复：。语言设置为英语，但文本为德语。