Python 如何将NLTK块输出到文件？_Python_Regex_File Io_Nlp_Nltk

Python 如何将NLTK块输出到文件？

python regex file-io nlp

Python 如何将NLTK块输出到文件？,python,regex,file-io,nlp,nltk,Python,Regex,File Io,Nlp,Nltk,我有一个python脚本，我使用nltk库解析、标记、标记和分块一些来自web的随机文本我需要将chunked1、chunked2、chunked3的输出格式化并写入文件。它们具有类型类“nltk.tree.tree” 更具体地说，我只需要编写与正则表达式chunkGram1，chunkGram2，chunkGram3匹配的行我该怎么做 #! /usr/bin/python2.7 import nltk import re import codecs xstring = ["An ele

我有一个python脚本，我使用nltk库解析、标记、标记和分块一些来自web的随机文本

我需要将

chunked1

、

chunked2

、

chunked3

的输出格式化并写入文件。它们具有类型

类“nltk.tree.tree”

更具体地说，我只需要编写与正则表达式

chunkGram1

，

chunkGram2

，

chunkGram3

匹配的行

我该怎么做

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        #print tokenized
        #print tagged

        chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
        chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
        chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""

        chunkParser1 = nltk.RegexpParser(chunkGram1)
        chunked1 = chunkParser1.parse(tagged)

        chunkParser2 = nltk.RegexpParser(chunkGram2)
        chunked2 = chunkParser2.parse(tagged)

        chunkParser3 = nltk.RegexpParser(chunkGram3)
        chunked3 = chunkParser2.parse(tagged)

        #print chunked1
        #print chunked2
        #print chunked3

        # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:

            # for i,line in enumerate(chunked1):
                # if "JJ" in line:
                    # outfile.write(line)
                # elif "NNP" in line:
                    # outfile.write(line)



processLanguage()

首先，看这个视频：

现在要得到正确的答案：

import re
import io

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."


chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
            for sent in sent_tokenize(xstring)]

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(str(chunk)+'\n\n')

[out]：

alvas@ubi:~$ python test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(str(chunk)+'\n\n')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

[out]：

alvas@ubi:~$ python test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(str(chunk)+'\n\n')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

您的代码有几个问题，尽管主要原因是

for

循环没有修改

xstring

的内容：

我将在此处解决代码中的所有问题：

不能使用单个

写入这样的路径，因为

\t

将被解释为制表符，

\f

将被解释为换行符。你必须加倍。我知道这是一个例子，但这种混淆经常出现：

with open('path\\to\\file.txt', 'r') as infile:
    xstring = infile.readlines()

以下

infle.close

行错误。它不调用close方法，实际上它什么都不做。此外，您的文件已被with子句关闭。如果您在任何地方的任何答案中看到这一行，请直接否决该答案，并附上注释，说明

file.close

是错误的，应该是

file.close（）

下面的方法应该有效，但您需要注意，它会将每个非ascii字符替换为

”

，这会破坏诸如naïve和café之类的单词

def remove_non_ascii(line):
    return ''.join([i if ord(i) < 128 else ' ' for i in line])

相反，它应该是：

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

或者是我最喜欢的一种蟒蛇：

xstring = [ remove_non_ascii(line) for line in xstring ]

虽然这些Unicode错误的出现主要是因为您使用Python2.7处理纯Unicode文本，最近的Python3版本在这方面遥遥领先，因此，我建议您，如果您刚开始执行任务，您应该很快升级到Python 3.4+。

带有行号的错误跟踪将有助于识别代码中导致

类型错误的原因。您的行
包含树
，而不是字符串
。尝试对包含的字符串进行迭代。@Selcuk您是否介意详细说明..？nltk.RegexpParser（）.parse（）
将返回一个Tree
s的迭代器。这就是为什么您需要通过另一个for
循环来重申行的内容。我无法测试它，因为我目前没有安装nltk。100%有效的即时解决方案是使用python3。我会接受你的答案，因为我重视你提供的反馈。也许你能在另一件小事上帮我。看看问题的edit部分。我会回答你的编辑，但我认为这本身就是另一个问题。最好在SO版主来之前再问一个问题，然后出于某种原因删除你的问题。哈哈哈=）你能把你的文件上传到某个地方，然后再问另一个关于数据清理的问题吗？如果我不知道这个文件是什么样子或者是什么文件，我也帮不了什么忙。根据文件和内容的不同，有101种清理数据的方法。脚本必须响应维基百科中的所有随机文本。依赖文本不是一个好主意，因此我正在寻找通用解决方案，如问题中实现的解决方案。。。。那你就得雇个专业人士来做这件事。如果您只需要维基百科，请参阅。顺便说一句，维基百科是用unicode编码的，不需要将其改为ascii，在NLP或文本处理中使用ascii是一种不好的做法。这是完全不鼓励的，除非你正在处理旧的文本。如果不是，那就没有退步的必要。也许其他人可以帮你编写一个通用脚本，但很抱歉我不能这么做=（顺便说一句，IBM雇佣了一个工程师团队来做这件事，所以答案永远不够。谢谢你的回答，我会在有时间的时候仔细看一看。
for i, line in enumerate(xstring):
   line = remove_non_ascii(line)

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

xstring = [ remove_non_ascii(line) for line in xstring ]