Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/327.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何创建';字流';和';文档流';用Python?_Python_Stream_Nlp - Fatal编程技术网

如何创建';字流';和';文档流';用Python?

如何创建';字流';和';文档流';用Python?,python,stream,nlp,Python,Stream,Nlp,我想把一堆文本文件合并成两个数组——一个“字流”和一个“文档流”。这是通过计算语料库中单词标记的总数来完成的,然后创建数组,其中单词流中的每个条目对应于与该标记关联的单词,文档流对应于单词来自的文档 例如,如果语料库是 Doc1: "The cat sat on the mat" Doc2: "The fox jumped over the dog" 单词stream将如下所示: WS: 1 2 3 4 1 5 1 6 7 8 1 9 DS: 1 1 1 1 1 1 2 2 2 2 2 2

我想把一堆文本文件合并成两个数组——一个“字流”和一个“文档流”。这是通过计算语料库中单词标记的总数来完成的,然后创建数组,其中单词流中的每个条目对应于与该标记关联的单词,文档流对应于单词来自的文档

例如,如果语料库是

Doc1: "The cat sat on the mat"
Doc2: "The fox jumped over the dog"
单词stream将如下所示:

WS: 1 2 3 4 1 5 1 6 7 8 1 9
DS: 1 1 1 1 1 1 2 2 2 2 2 2 

我不太清楚该怎么做,所以我的问题基本上是:如何将一个文本文件转换成一个单词标记数组

像这样的?这是Python3代码,但我认为这只在
print
语句中起作用。评论中有一些注释,供将来添加

strings = [ 'The cat sat on the mat',           # documents to process
            'The fox jumped over the dog' ]
docstream = []                                  # document indices
wordstream = []                                 # token indices
words = []                                      # tokens themselves

# Return an array of words in the given string. NOTE: this splits up by
# spaces, in real life you might want to split by multiple spaces, newlines,
# tabs, what you have. See regular expressions in the module 're' and
# 're.split(...)'
def tokenize(s):
    return s.split(' ')

# Lookup a token in the wordstream. If not present (yet), append it to the
# wordstream and return the new position. NOTE: in real life you might want
# to fold cases so that 'The' and 'the' are treated the same.
def lookup_token(token):
    for i in range(len(words)):
        if words[i] == token:
            print('Found', token, 'at index', i)
            return i
    words.append(token)
    print('Appended', token, 'at index', len(words) - 1)
    return len(words) - 1

# Main starts here
for stringindex in range(len(strings)):
    print('Analyzing string:', strings[stringindex])
    tokens = tokenize(strings[stringindex])
    for t in tokens:
        print('Analyzing token', t, 'from string', stringindex)
        docstream.append(stringindex)
        wordstream.append(lookup_token(t))

# Done.
print(wordstream)
print(docstream)

当docstream打印时,它生成0和1的数组,而不是1和2。另外,为了处理“the”和“the”错误,我只使用.lower()True将字符串设置为小写。我使用了文字数组索引,所以是0和1。