Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/284.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用nltk标记单词列表?_Python_List_Nltk_Tokenize_Word - Fatal编程技术网

Python 如何使用nltk标记单词列表?

Python 如何使用nltk标记单词列表?,python,list,nltk,tokenize,word,Python,List,Nltk,Tokenize,Word,我有一个文本数据集。这些数据集由许多行组成,每行由两个按tab拆分的句子组成,如下所示: this is string 1, first sentence. this is string 2, first sentence. this is string 1, second sentence. this is string 2, second sentence. [['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence

我有一个文本数据集。这些数据集由许多行组成,每行由两个按tab拆分的句子组成,如下所示:

this is string 1, first sentence.    this is string 2, first sentence.
this is string 1, second sentence.    this is string 2, second sentence.
[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']]
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]
然后我用以下代码拆分了datatext:

#file readdata.py
from globalvariable import *
import os


class readdata:
    def dataAyat(self):
        global kalimatayat
        fo = open(os.path.join('E:\dataset','dataset.txt'),"r")
        line = []
        for line in fo.readlines():
            datatxt = line.rstrip('\n').split('\t')
            newdatatxt = [x.split('\t') for x in datatxt]
            kalimatayat.append(newdatatxt)
            print newdatatxt

readdata().dataAyat()
它工作正常,输出为:

[['this is string 1, first sentence.'],['this is string 2, first sentence.']]
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]
我想做的是使用nltk word tokenize对这些列表进行标记化,我期望的输出如下:

this is string 1, first sentence.    this is string 2, first sentence.
this is string 1, second sentence.    this is string 2, second sentence.
[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']]
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]
有人知道如何标记为上面的输出吗?
我想在tokenizer.py中编写一个tokenize函数,并在mainfile.py中调用它来标记句子列表,对其进行迭代并将结果存储在列表中:

data = [[['this is string 1, first sentence.'],['this is string 2, first sentence.']],
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]]
results = []
for sentence in data:
    sentence_results = []
    for s in sentence:
        sentence_results.append(nltk.word_tokenize(sentence))
    results.append(sentence_results)
结果会是这样的

[[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],  
  ['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']], 
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],
  ['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]]

是的,但它是一根弦。我想把一系列句子标记出来,先生。我想再问一次。如果readdata和tokenizer分为两个py文件readdata.py和tokenizer.py,我想将它们合并到主文件mainfile.py中,那么如何将tokenizer.py和mainfile.py编码为与上述结果类似的标记化?嗯,我不确定我是否理解您的问题。首先,您需要将文本作为可以处理的内容。在readData函数中,您似乎将其存储在全局kalimatayat列表中,该列表肯定应该由类成员替换。这个变量基本上就是我在回答中给出的例子中所说的数据。所以数据是一个kalimatayat。好的,先生,我试试看。毕竟,谢谢你的回答:先生,我已经编辑了这个问题。你可以再检查一遍。