Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何检测文本多语言中的独立文本?_Python_Python 3.x_Nlp_Spacy - Fatal编程技术网

Python 如何检测文本多语言中的独立文本?

Python 如何检测文本多语言中的独立文本?,python,python-3.x,nlp,spacy,Python,Python 3.x,Nlp,Spacy,我在输入中有一个文本(英文和法文):我想为每种语言将文本分为两个子文本:因此,我们将检测文本中的语言(>2种语言),然后用自己的语言剪切每个文本: 输入: You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the perfect moment to be tasted. From the bottle to the salmanazar, from the

我在输入中有一个文本(英文和法文):我想为每种语言将文本分为两个子文本:因此,我们将检测文本中的语言(>2种语言),然后用自己的语言剪切每个文本:

输入:

You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the 
perfect moment to be tasted. From the bottle to the salmanazar, from the youngest wines to the oldest   
vintages. - Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les   
caves, attendant le parfait moment pour être dégustées.
期望输出:

This text contains two languages : "fr" and "en"

text_in_english= "You will then discover galleries in which 25-million bottles rest in the cellars,  
waiting for the perfect moment to be tasted. From the bottle to the salmanazar, from the youngest  
wines to the oldest vintages."

text_in_frensh= "- Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent 
dans les caves, attendant le parfait moment pour être dégustées."
我们如何才能做到这一点呢?

我建议使用nltk(如果需要,请取消对punkt的下载注释)和langid。步骤如下:

1. Split the text into sentences  
2. Predict the language of each sentence  
3. Add the predicted sentences to a dictionary to group them by language, in order.
蟒蛇3:

from langid import classify
from nltk import tokenize
import nltk
from collections import defaultdict
#nltk.download('punkt')

mytext = """
You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the 
perfect moment to be tasted. From the bottle to the salmanazar, from the youngest wines to the oldest 
vintages. - Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les 
caves, attendant le parfait moment pour être dégustées.
"""
mytext = mytext.replace('\n', ' ').replace('\r', '')
sentences = tokenize.sent_tokenize(mytext)
languages = defaultdict(list)

for sentence in sentences:
    languages[str(classify(sentence)[0])].append(sentence)

for k,v in languages.items():
    print(k,v)
#en [' You will then discover galleries in which 25-million bottles rest in the cellars, waiting for the  perfect moment to be tasted.', 'From the bottle to the salmanazar, from the youngest wines to the oldest  vintages.']   
#fr ['- Vous trouverez alors des galeries parmi lesquelles 25 000 000 bouteilles reposent dans les  caves, attendant le parfait moment pour être dégustées.']

问题是,有时我们有“英语句子”-“法语句子”。。有时我们会遇到:{-;,;.%,}在两个languageno问题之间,只需将它们添加到“mytext.replace()”行,或者尝试使用nltk以更高级的方式清理文本。我认为您的问题与spacy无关。如果我是你,我会在谷歌上搜索语言检测,然后将文本分割成句子(如果你坚持的话,可以使用空格),然后检测每个句子的语言。