Localization 韩语标记器_Localization_Solr_Nlp_Tokenize

Localization 韩语标记器

localization solr nlp

Localization 韩语标记器,localization,solr,nlp,tokenize,Localization,Solr,Nlp,Tokenize,处理朝鲜语的最佳标记器是什么我在Solr4.0中尝试了。它正在进行标记化，但准确率非常低。是一款韩语词法分析器，能够标记和POS标记韩语数据，无需花费太多精力。该软件在其训练和测试的语料库中报告了90.7%（参见）词性标注在我一直在研究的a的韩语数据上实现了81% 然而，有一个陷阱，你必须使用windows来运行软件。但我有一个脚本可以绕过这个限制，下面是脚本： #!/bin/bash -x ##################################################

处理朝鲜语的最佳标记器是什么

我在Solr4.0中尝试了。它正在进行标记化，但准确率非常低。
是一款韩语词法分析器，能够标记和POS标记韩语数据，无需花费太多精力。该软件在其训练和测试的语料库中报告了90.7%（参见）
词性标注在我一直在研究的a的韩语数据上实现了81%
然而，有一个陷阱，你必须使用windows来运行软件。但我有一个脚本可以绕过这个限制，下面是脚本：

#!/bin/bash -x ############################################################################### ## Sejong-Shell is a script to call POSTAG/SEJONG tagger on Unix Machine ## because POSTAG/Sejong is only usable in Korean Microsoft Windows environment ## the original POSTAG/Sejong can be downloaded from ## http://isoft.postech.ac.kr/Course/CS730b/2005/index.html ## ## Sejong-Shell is dependent on WINdows Emulator. ## The WINE program can be downloaded from ## http://www.winehq.org/download/ ## ## The shell scripts accepts the input files from one directory and ## outputs the tagged files into another while retaining the filename ############################################################################### cd <source-file_dir> #<source_-ile_dir> is the directory that saves the textfiles that needs tagging for file in `dir -d *` do echo $file sudo cp <source-file_dir>/"$file" <POSTAG-Sejong_dir>/input.txt # <POSTAG-Sejong_dir> refers to the directory where the pos-tagger is saved wine start /Unix "$HOME/postagsejong/sjTaggerInteg.exe" sleep 30 # This is necessary so that the file from the current loop won't be # overlapping with the next, do increase the time for sleep if the file # is large and needs more than 30 sec for POSTAG/Sejong to tag. sudo cp <POSTAG-Sejong_dir>/output.txt <target-file_dir>/"$file" # <target-file_dir> is where you want the output files to be stored done # Instead of the sleep command to prevent the overlap: # $sleep 30 # Alternatively, you can manually continue a loop with the following # command that continues a loop after a keystroke input: # $read -p "Press any key to continue…"

（< SeJeon Shell 来源：李玲覃。2011。为南洋理工大学建立基础文本-多语种语料库（NTU-MC）。最后一年项目。新加坡：南洋理工大学。第44页）
你正在寻找免费/OSS记录器，对吧？我担心，据我所知，CJKV语言的唯一标记器或多或少都能正常工作，这是一种商业产品。
#!/usr/bin/python # -*- coding: utf-8 -*- ''' pre-sejong clean ''' import codecs import nltk import os, sys, re, glob from nltk.tokenize import RegexpTokenizer reload(sys) sys.setdefaultencoding('utf-8') cwd = './gizaclean_ko' #os.getcwd() wrd = './presejong_ko' kr_sent_tokenizer = nltk.RegexpTokenizer(u'[^！？.?!]*[！？."www.*"]') for infile in glob.glob(os.path.join(cwd, '*.txt')): # if infile == './extract_ko/singapore-sling.txt': continue # if infile == './extract_ko/ion-orchard.txt': continue print infile (PATH, FILENAME) = os.path.split(infile) reader = open(infile) writer = open(os.path.join(wrd, FILENAME).encode('euc-kr'),'w') for line in reader: para = []urlread = lambda url: urllib.urlopen(url).read() para.append (kr_sent_tokenizer.tokenize(unicode(line,'utf-8').strip())) for sent in para[0]: newsent = sent.replace(u'\xa0', ' '.encode('utf-8')) newsent2 = newsent.replace(u'\xe7', 'c'.encode('utf-8')) newsent3 = newsent2.replace(u'\xe9', 'e'.encode('utf-8')) newsent4 = newsent3.replace(u'\u2013', '-') newsent5 = newsent4.replace(u'\xa9', '(c)') newsent6 = newsent5.encode('euc-kr').strip() print newsent6 writer.write(newsent6+'\n')