Python 如何检测文本多语言中的独立文本？_Python_Python 3.x_Nlp_Spacy

Python 如何检测文本多语言中的独立文本？

python python-3.x nlp

Python 如何检测文本多语言中的独立文本？,python,python-3.x,nlp,spacy,Python,Python 3.x,Nlp,Spacy,我在输入中有一个文本（英文和法文）：我想为每种语言将文本分为两个子文本：因此，我们将检测文本中的语言（>2种语言），然后用自己的语言剪切每个文本：输入： You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the perfect moment to be tasted. From the bottle to the salmanazar, from the

我在输入中有一个文本（英文和法文）：我想为每种语言将文本分为两个子文本：因此，我们将检测文本中的语言（>2种语言），然后用自己的语言剪切每个文本：

输入：

You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the 
perfect moment to be tasted. From the bottle to the salmanazar, from the youngest wines to the oldest   
vintages. - Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les   
caves, attendant le parfait moment pour être dégustées.

期望输出：

This text contains two languages : "fr" and "en"

text_in_english= "You will then discover galleries in which 25-million bottles rest in the cellars,  
waiting for the perfect moment to be tasted. From the bottle to the salmanazar, from the youngest  
wines to the oldest vintages."

text_in_frensh= "- Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent 
dans les caves, attendant le parfait moment pour être dégustées."

我们如何才能做到这一点呢？

我建议使用nltk（如果需要，请取消对punkt的下载注释）和langid。步骤如下：

1. Split the text into sentences  
2. Predict the language of each sentence  
3. Add the predicted sentences to a dictionary to group them by language, in order.

蟒蛇3：

from langid import classify
from nltk import tokenize
import nltk
from collections import defaultdict
#nltk.download('punkt')

mytext = """
You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the 
perfect moment to be tasted. From the bottle to the salmanazar, from the youngest wines to the oldest 
vintages. - Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les 
caves, attendant le parfait moment pour être dégustées.
"""
mytext = mytext.replace('\n', ' ').replace('\r', '')
sentences = tokenize.sent_tokenize(mytext)
languages = defaultdict(list)

for sentence in sentences:
    languages[str(classify(sentence)[0])].append(sentence)

for k,v in languages.items():
    print(k,v)
#en [' You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the  perfect moment to be tasted.', 'From the bottle to the salmanazar, from the youngest wines to the oldest  vintages.']   
#fr ['- Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les  caves, attendant le parfait moment pour être dégustées.']

问题是，有时我们有“英语句子”-“法语句子”。。有时我们会遇到：{-；，；.%，}在两个languageno问题之间，只需将它们添加到“mytext.replace（）”行，或者尝试使用nltk以更高级的方式清理文本。我认为您的问题与spacy无关。如果我是你，我会在谷歌上搜索语言检测，然后将文本分割成句子（如果你坚持的话，可以使用空格），然后检测每个句子的语言。