Python 使用正则表达式在矩阵中的列或行上拆分字符串

Python 使用正则表达式在矩阵中的列或行上拆分字符串,python,regex,numpy,matrix,group-by,Python,Regex,Numpy,Matrix,Group By,对矩阵中的列执行重新拆分的最佳方式是什么。 我感兴趣的是这样做是为了保存与矩阵列中字符串中的字符相关联的不同注释或标签 以下是我到目前为止阐述我的观点的内容: from re import split from numpy import vstack, zeros text = 'This is a line of text. It will need to be split into sentences.' # binary annotations of characters (1 den

对矩阵中的列执行
重新拆分
的最佳方式是什么。
我感兴趣的是这样做是为了保存与矩阵列中字符串中的字符相关联的不同注释或标签

以下是我到目前为止阐述我的观点的内容:

from re import split
from numpy import vstack, zeros

text = 'This is a line of text.  It will need to be split into sentences.'
# binary annotations of characters (1 denotes a character of interest which belongs to a word of interest)
bin_ann = zeros(len(text))
# This is just to label some of the words in the text
text_to_label = ['text', 'sentences']
for l in text_to_label:
    start = text.find(l)
    end = start + len(l)
    bin_ann[start:end] = 1

# we can zip these together and make a matrix such that each character is now labeled.
z = zip(list(text), bin_ann)
nz = vstack(z)
# This is our lebeled text matrix
print(nz)
print(nz[:,0])

s = split('(\.\s+)', nz[:,0])
print(s)
这将产生以下输出:

[['T' '0.0']
 ['h' '0.0']
 ['i' '0.0']
 ['s' '0.0']
 [' ' '0.0']
 ['i' '0.0']
 ['s' '0.0']
 [' ' '0.0']
 ['a' '0.0']
 [' ' '0.0']
 ['l' '0.0']
 ['i' '0.0']
 ['n' '0.0']
 ['e' '0.0']
 [' ' '0.0']
 ['o' '0.0']
 ['f' '0.0']
 [' ' '0.0']
 ['t' '1.0']
 ['e' '1.0']
 ['x' '1.0']
 ['t' '1.0']
 ['.' '0.0']
 [' ' '0.0']
 [' ' '0.0']
 ['I' '0.0']
 ['t' '0.0']
 [' ' '0.0']
 ['w' '0.0']
 ['i' '0.0']
 ['l' '0.0']
 ['l' '0.0']
 [' ' '0.0']
 ['n' '0.0']
 ['e' '0.0']
 ['e' '0.0']
 ['d' '0.0']
 [' ' '0.0']
 ['t' '0.0']
 ['o' '0.0']
 [' ' '0.0']
 ['b' '0.0']
 ['e' '0.0']
 [' ' '0.0']
 ['s' '0.0']
 ['p' '0.0']
 ['l' '0.0']
 ['i' '0.0']
 ['t' '0.0']
 [' ' '0.0']
 ['i' '0.0']
 ['n' '0.0']
 ['t' '0.0']
 ['o' '0.0']
 [' ' '0.0']
 ['s' '1.0']
 ['e' '1.0']
 ['n' '1.0']
 ['t' '1.0']
 ['e' '1.0']
 ['n' '1.0']
 ['c' '1.0']
 ['e' '1.0']
 ['s' '1.0']
 ['.' '0.0']]
['T' 'h' 'i' 's' ' ' 'i' 's' ' ' 'a' ' ' 'l' 'i' 'n' 'e' ' ' 'o' 'f' ' '
 't' 'e' 'x' 't' '.' ' ' ' ' 'I' 't' ' ' 'w' 'i' 'l' 'l' ' ' 'n' 'e' 'e'
 'd' ' ' 't' 'o' ' ' 'b' 'e' ' ' 's' 'p' 'l' 'i' 't' ' ' 'i' 'n' 't' 'o'
 ' ' 's' 'e' 'n' 't' 'e' 'n' 'c' 'e' 's' '.']
['This', ' ', 'is', ' ', 'a', ' ', 'line', ' ', 'of', ' ', 'text', '.', '', '  ', 'It', ' ', 'will', ' ', 'need', ' ', '
to', ' ', 'be', ' ', 'split', ' ', 'into', ' ', 'sentences', '.', '']
我希望在将标记分组在一起时维护注释字符矩阵。因此,期望的输出可能是这样的:

[ ['This' 0.0] [' ' 0.0] ['a' 0.0] ... ['text' 1.0] ...['sentences' 1.0] ['.' 0.0] ]

只需匹配单词和可选句点
(\w+)(\s*[.])?
就可以构建输出数组。如果$2有长度,那么您已经到达了一个句点。使用列表不是比使用数组简单得多吗?@hpaulj我不确定简单地使用列表会有多大帮助,但如果使用列表,可能会更容易阅读。它绝对不能解决分割文本时保存长度和知道分割位置的问题。哪种版本的python?调用
split
会给我一个
TypeError:在3.6和2.7版本中,预期的字符串或类似字节的object
(2.7版本说的是
buffer
,而不是
类似字节的object
),而且,您所需的输出看起来更像注释单词矩阵,而不是注释字符矩阵。