如何从python中的文本中创建句子的二维单词数组？_Python_Arrays_List

如何从python中的文本中创建句子的二维单词数组？

python arrays list

如何从python中的文本中创建句子的二维单词数组？,python,arrays,list,Python,Arrays,List,我有一篇课文，让我们用5句话来说： Lorem Ipsum只是印刷和排版的虚拟文本工业。Lorem Ipsum已成为行业标准的虚拟文本从16世纪开始，一个不知名的印刷工在厨房里打字然后把它拼凑成一本样本书。它没有幸存下来只有五个世纪，而且是电子排版的飞跃，基本保持不变。它在20世纪60年代开始流行包含的Letraset表的发布。Lorem Ipsum通道，和最近使用的是像Aldus PageMaker这样的桌面发布软件包括Lorem Ipsum的版本使用python，我如何将其

我有一篇课文，让我们用5句话来说：

Lorem Ipsum只是印刷和排版的虚拟文本工业。Lorem Ipsum已成为行业标准的虚拟文本从16世纪开始，一个不知名的印刷工在厨房里打字然后把它拼凑成一本样本书。它没有幸存下来只有五个世纪，而且是电子排版的飞跃，基本保持不变。它在20世纪60年代开始流行包含的Letraset表的发布。Lorem Ipsum通道，和最近使用的是像Aldus PageMaker这样的桌面发布软件包括Lorem Ipsum的版本

使用python，我如何将其转换为两个demensianal数组，其中每个句子被拆分为单独的单词

如果我们以第一句话为例，我需要的是数组的第一个元素：

['lorem', 'ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry']

我可以通过以下命令来实现：

string = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'

string = string.lower()
arrWords = re.split('[^a-z]', string)
arrWords = filter(None, arrWords)
print arrWords

但是我如何通过循环句子文本来构建这些元素的数组呢？

虽然通常很难准确地说出一个句子的结尾，但在这种情况下，每个句子上都有句点标记，因此我们可以使用句点将段落拆分成句子。您已经有了将其拆分为单词的代码，但它是：

paragraph = "Lorem Ipsum ... "
sentences = []
while paragraph.find('.') != -1:
    index = paragraph.find('.')
    sentences.append(paragraph[:index+1])
    paragraph = paragraph[index+1:]

print sentences

产出：

['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", 
'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.', 
'It was popularised in the 1960s with the release of Letraset sheets containing.', 
'Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.']

然后我们将它们全部转换为单词数组：

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix

for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word

哪些产出：

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry.'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book.'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged.'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing.'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum.']]

虽然通常很难准确地说出一个句子的结尾，但在这种情况下，每个句子都有句点标记结尾，因此我们可以使用句点将段落拆分为几个句子。您已经有了将其拆分为单词的代码，但它是：

paragraph = "Lorem Ipsum ... "
sentences = []
while paragraph.find('.') != -1:
    index = paragraph.find('.')
    sentences.append(paragraph[:index+1])
    paragraph = paragraph[index+1:]

print sentences

产出：

['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", 
'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.', 
'It was popularised in the 1960s with the release of Letraset sheets containing.', 
'Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.']

然后我们将它们全部转换为单词数组：

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix

for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word

哪些产出：

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry.'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book.'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged.'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing.'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum.']]

这里的挑战是如何确定句子的结尾。我认为您可以使用正则表达式来涵盖大多数内容，但下面所示的简单列表理解将涵盖虚拟文本，因为所有内容都以句点结束

    x = "Lorem Ipsum is simply dummy ..."

    words = [sentence.split(" ") for sentence in x.split(". ")]

    x = "Lorem Ipsum is simply dummy ..."

    words = [sentence.split(" ") for sentence in x.split(". ")]

假设每个句子都以“.”结尾（如您所述的示例）

设置：

para=input("Enter the Para : ")        #input : Paragraph
sentence=[]         #Store list of sentences
word=[]             #Store final list of 2D array

句子清单：

sentence=para.split('.')    #Split at '.' (periods)
sentence.pop()              #Last Element will be '' due to usage of split. So pop the last element

获取单词列表：

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix

for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word

打印结果：

print(word)

输入：

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]

输入以下段落：

Lorem Ipsum只是印刷和印刷的虚拟文本排版业。Lorem Ipsum已成为行业标准从16世纪开始，一个不知名的印刷商在厨房里印刷文字然后把它拼凑成一本样本书。它幸存了下来不仅是五个世纪，而且是电子化的飞跃排版，基本保持不变。它在中国很流行 20世纪60年代，随着Letraset图纸的发布，包含了。乱数假文段落，以及最近使用的桌面发布软件，如 Aldus PageMaker，包括Lorem Ipsum版本

输出：

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]

对于拆分成除句号“.”以外的字符的句子。用作句子结尾，可以使用

re.split（）

函数。有关更多信息，请浏览此链接：

假设每个句子都以“.”结尾（如所述示例）

设置：

para=input("Enter the Para : ")        #input : Paragraph
sentence=[]         #Store list of sentences
word=[]             #Store final list of 2D array