Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/solr/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Localization 韩语标记器_Localization_Solr_Nlp_Tokenize - Fatal编程技术网

Localization 韩语标记器

Localization 韩语标记器,localization,solr,nlp,tokenize,Localization,Solr,Nlp,Tokenize,处理朝鲜语的最佳标记器是什么 我在Solr4.0中尝试了。它正在进行标记化,但准确率非常低。是一款韩语词法分析器,能够标记和POS标记韩语数据,无需花费太多精力。该软件在其训练和测试的语料库中报告了90.7%(参见) 词性标注在我一直在研究的a的韩语数据上实现了81% 然而,有一个陷阱,你必须使用windows来运行软件。但我有一个脚本可以绕过这个限制,下面是脚本: #!/bin/bash -x ##################################################

处理朝鲜语的最佳标记器是什么

我在Solr4.0中尝试了。它正在进行标记化,但准确率非常低。

是一款韩语词法分析器,能够标记和POS标记韩语数据,无需花费太多精力。该软件在其训练和测试的语料库中报告了90.7%(参见)

词性标注在我一直在研究的a的韩语数据上实现了81%

然而,有一个陷阱,你必须使用windows来运行软件。但我有一个脚本可以绕过这个限制,下面是脚本:

#!/bin/bash -x
###############################################################################
## Sejong-Shell is a script to call POSTAG/SEJONG tagger on Unix Machine
## because POSTAG/Sejong is only usable in Korean Microsoft Windows environment
## the original POSTAG/Sejong can be downloaded from
## http://isoft.postech.ac.kr/Course/CS730b/2005/index.html
##
## Sejong-Shell is dependent on WINdows Emulator.
## The WINE program can be downloaded from
## http://www.winehq.org/download/
##
## The shell scripts accepts the input files from one directory and
## outputs the tagged files into another while retaining the filename
###############################################################################

cd <source-file_dir>
#<source_-ile_dir> is the directory that saves the textfiles that needs tagging
for file in `dir -d *`
do
    echo $file
    sudo cp <source-file_dir>/"$file" <POSTAG-Sejong_dir>/input.txt
    # <POSTAG-Sejong_dir> refers to the directory where the pos-tagger is saved
    wine start /Unix "$HOME/postagsejong/sjTaggerInteg.exe"
    sleep 30
    # This is necessary so that the file from the current loop won't be
    # overlapping with the next, do increase the time for sleep if the file
    # is large and needs more than 30 sec for POSTAG/Sejong to tag.
    sudo cp <POSTAG-Sejong_dir>/output.txt <target-file_dir>/"$file"
    # <target-file_dir> is where you want the output files to be stored
done

# Instead of the sleep command to prevent the overlap:
#   $sleep 30
# Alternatively, you can manually continue a loop with the following 
# command that continues a loop after a keystroke input:
#   $read -p "Press any key to continue…"

(< SeJeon Shell 来源:李玲覃。2011。为南洋理工大学建立基础文本-多语种语料库(NTU-MC)。最后一年项目。新加坡:南洋理工大学。第44页)

你正在寻找免费/OSS记录器,对吧?我担心,据我所知,CJKV语言的唯一标记器或多或少都能正常工作,这是一种商业产品。
#!/usr/bin/python # -*- coding: utf-8 -*-

'''
pre-sejong clean
'''

import codecs
import nltk
import os, sys, re, glob
from nltk.tokenize import RegexpTokenizer

reload(sys)
sys.setdefaultencoding('utf-8')

cwd = './gizaclean_ko' #os.getcwd()
wrd = './presejong_ko'

kr_sent_tokenizer = nltk.RegexpTokenizer(u'[^!?.?!]*[!?."www.*"]')


for infile in glob.glob(os.path.join(cwd, '*.txt')):
#   if infile == './extract_ko/singapore-sling.txt': continue
#   if infile == './extract_ko/ion-orchard.txt': continue
        print infile
        (PATH, FILENAME) = os.path.split(infile)
        reader = open(infile)
        writer = open(os.path.join(wrd, FILENAME).encode('euc-kr'),'w')
        for line in reader:
                para = []urlread = lambda url: urllib.urlopen(url).read()
                para.append (kr_sent_tokenizer.tokenize(unicode(line,'utf-8').strip()))
                for sent in para[0]:
            newsent = sent.replace(u'\xa0', ' '.encode('utf-8'))
            newsent2 = newsent.replace(u'\xe7', 'c'.encode('utf-8'))
            newsent3 = newsent2.replace(u'\xe9', 'e'.encode('utf-8'))
            newsent4 = newsent3.replace(u'\u2013', '-')
            newsent5 = newsent4.replace(u'\xa9', '(c)')
            newsent6 = newsent5.encode('euc-kr').strip()
            print newsent6
            writer.write(newsent6+'\n')