Python 如何检测文本多语言中的独立文本?
我在输入中有一个文本(英文和法文):我想为每种语言将文本分为两个子文本:因此,我们将检测文本中的语言(>2种语言),然后用自己的语言剪切每个文本: 输入:Python 如何检测文本多语言中的独立文本?,python,python-3.x,nlp,spacy,Python,Python 3.x,Nlp,Spacy,我在输入中有一个文本(英文和法文):我想为每种语言将文本分为两个子文本:因此,我们将检测文本中的语言(>2种语言),然后用自己的语言剪切每个文本: 输入: You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the perfect moment to be tasted. From the bottle to the salmanazar, from the
You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the
perfect moment to be tasted. From the bottle to the salmanazar, from the youngest wines to the oldest
vintages. - Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les
caves, attendant le parfait moment pour être dégustées.
期望输出:
This text contains two languages : "fr" and "en"
text_in_english= "You will then discover galleries in which 25-million bottles rest in the cellars,
waiting for the perfect moment to be tasted. From the bottle to the salmanazar, from the youngest
wines to the oldest vintages."
text_in_frensh= "- Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent
dans les caves, attendant le parfait moment pour être dégustées."
我们如何才能做到这一点呢?我建议使用nltk(如果需要,请取消对punkt的下载注释)和langid。步骤如下:
1. Split the text into sentences
2. Predict the language of each sentence
3. Add the predicted sentences to a dictionary to group them by language, in order.
蟒蛇3:
from langid import classify
from nltk import tokenize
import nltk
from collections import defaultdict
#nltk.download('punkt')
mytext = """
You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the
perfect moment to be tasted. From the bottle to the salmanazar, from the youngest wines to the oldest
vintages. - Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les
caves, attendant le parfait moment pour être dégustées.
"""
mytext = mytext.replace('\n', ' ').replace('\r', '')
sentences = tokenize.sent_tokenize(mytext)
languages = defaultdict(list)
for sentence in sentences:
languages[str(classify(sentence)[0])].append(sentence)
for k,v in languages.items():
print(k,v)
#en [' You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the perfect moment to be tasted.', 'From the bottle to the salmanazar, from the youngest wines to the oldest vintages.']
#fr ['- Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les caves, attendant le parfait moment pour être dégustées.']
问题是,有时我们有“英语句子”-“法语句子”。。有时我们会遇到:{-;,;.%,}在两个languageno问题之间,只需将它们添加到“mytext.replace()”行,或者尝试使用nltk以更高级的方式清理文本。我认为您的问题与spacy无关。如果我是你,我会在谷歌上搜索语言检测,然后将文本分割成句子(如果你坚持的话,可以使用空格),然后检测每个句子的语言。