Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/arrays/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何从python中的文本中创建句子的二维单词数组?_Python_Arrays_List - Fatal编程技术网

如何从python中的文本中创建句子的二维单词数组?

如何从python中的文本中创建句子的二维单词数组?,python,arrays,list,Python,Arrays,List,我有一篇课文,让我们用5句话来说: Lorem Ipsum只是印刷和排版的虚拟文本 工业。Lorem Ipsum已成为行业标准的虚拟文本 从16世纪开始,一个不知名的印刷工在厨房里打字 然后把它拼凑成一本样本书。它没有幸存下来 只有五个世纪,而且是电子排版的飞跃, 基本保持不变。它在20世纪60年代开始流行 包含的Letraset表的发布。Lorem Ipsum通道,和 最近使用的是像Aldus PageMaker这样的桌面发布软件 包括Lorem Ipsum的版本 使用python,我如何将其

我有一篇课文,让我们用5句话来说:

Lorem Ipsum只是印刷和排版的虚拟文本 工业。Lorem Ipsum已成为行业标准的虚拟文本 从16世纪开始,一个不知名的印刷工在厨房里打字 然后把它拼凑成一本样本书。它没有幸存下来 只有五个世纪,而且是电子排版的飞跃, 基本保持不变。它在20世纪60年代开始流行 包含的Letraset表的发布。Lorem Ipsum通道,和 最近使用的是像Aldus PageMaker这样的桌面发布软件 包括Lorem Ipsum的版本

使用python,我如何将其转换为两个demensianal数组,其中每个句子被拆分为单独的单词

如果我们以第一句话为例,我需要的是数组的第一个元素:

['lorem', 'ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry']
我可以通过以下命令来实现:

string = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'

string = string.lower()
arrWords = re.split('[^a-z]', string)
arrWords = filter(None, arrWords)
print arrWords

但是我如何通过循环句子文本来构建这些元素的数组呢?

虽然通常很难准确地说出一个句子的结尾,但在这种情况下,每个句子上都有句点标记,因此我们可以使用句点将段落拆分成句子。您已经有了将其拆分为单词的代码,但它是:

paragraph = "Lorem Ipsum ... "
sentences = []
while paragraph.find('.') != -1:
    index = paragraph.find('.')
    sentences.append(paragraph[:index+1])
    paragraph = paragraph[index+1:]

print sentences
产出:

['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", 
'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.', 
'It was popularised in the 1960s with the release of Letraset sheets containing.', 
'Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.']
然后我们将它们全部转换为单词数组:

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix
for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word
哪些产出:

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry.'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book.'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged.'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing.'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum.']]

虽然通常很难准确地说出一个句子的结尾,但在这种情况下,每个句子都有句点标记结尾,因此我们可以使用句点将段落拆分为几个句子。您已经有了将其拆分为单词的代码,但它是:

paragraph = "Lorem Ipsum ... "
sentences = []
while paragraph.find('.') != -1:
    index = paragraph.find('.')
    sentences.append(paragraph[:index+1])
    paragraph = paragraph[index+1:]

print sentences
产出:

['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", 
'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.', 
'It was popularised in the 1960s with the release of Letraset sheets containing.', 
'Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.']
然后我们将它们全部转换为单词数组:

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix
for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word
哪些产出:

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry.'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book.'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged.'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing.'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum.']]

这里的挑战是如何确定句子的结尾。我认为您可以使用正则表达式来涵盖大多数内容,但下面所示的简单列表理解将涵盖虚拟文本,因为所有内容都以句点结束

    x = "Lorem Ipsum is simply dummy ..."

    words = [sentence.split(" ") for sentence in x.split(". ")]

这里的挑战是如何确定句子的结尾。我认为您可以使用正则表达式来涵盖大多数内容,但下面所示的简单列表理解将涵盖虚拟文本,因为所有内容都以句点结束

    x = "Lorem Ipsum is simply dummy ..."

    words = [sentence.split(" ") for sentence in x.split(". ")]

假设每个句子都以“.”结尾(如您所述的示例)

设置:

para=input("Enter the Para : ")        #input : Paragraph
sentence=[]         #Store list of sentences
word=[]             #Store final list of 2D array
句子清单:

sentence=para.split('.')    #Split at '.' (periods)
sentence.pop()              #Last Element will be '' due to usage of split. So pop the last element
获取单词列表:

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix
for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word
打印结果:

print(word)
输入:

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]
输入以下段落:

Lorem Ipsum只是印刷和印刷的虚拟文本 排版业。Lorem Ipsum已成为行业标准 从16世纪开始,一个不知名的印刷商在厨房里印刷文字 然后把它拼凑成一本样本书。它幸存了下来 不仅是五个世纪,而且是电子化的飞跃 排版,基本保持不变。它在中国很流行 20世纪60年代,随着Letraset图纸的发布,包含了。乱数假文 段落,以及最近使用的桌面发布软件,如 Aldus PageMaker,包括Lorem Ipsum版本

输出:

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]

对于拆分成除句号“.”以外的字符的句子。用作句子结尾,可以使用
re.split()
函数。有关更多信息,请浏览此链接:

假设每个句子都以“.”结尾(如所述示例)

设置:

para=input("Enter the Para : ")        #input : Paragraph
sentence=[]         #Store list of sentences
word=[]             #Store final list of 2D array
句子清单:

sentence=para.split('.')    #Split at '.' (periods)
sentence.pop()              #Last Element will be '' due to usage of split. So pop the last element
获取单词列表:

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix
for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word
打印结果:

print(word)
输入:

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]
输入以下段落:

Lorem Ipsum只是印刷和印刷的虚拟文本 排版业。Lorem Ipsum已成为行业标准 从16世纪开始,一个不知名的印刷商在厨房里印刷文字 然后把它拼凑成一本样本书。它幸存了下来 不仅是五个世纪,而且是电子化的飞跃 排版,基本保持不变。它在中国很流行 20世纪60年代,随着Letraset图纸的发布,包含了。乱数假文 段落,以及最近使用的桌面发布软件,如 Aldus PageMaker,包括Lorem Ipsum版本

输出:

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]

对于拆分成除句号“.”以外的字符的句子。用作句子结尾,可以使用
re.split()
函数。有关详细信息,请浏览此链接:

删除逗号,然后按
拆分,然后再次按空格拆分(不带
拆分的参数)

这会在最后留下一个空列表,您可以通过切片或通过
filter(None,…)


删除逗号,然后按
拆分,然后按空格再次拆分(不带
拆分
的参数)

这会在最后留下一个空列表,您可以通过切片或通过
filter(None,…)


你需要把课文分成句子,然后再分成单词。你如何决定一个句子的结尾可能很难。你看过python的NLTK包了吗?[i.split(“”)代表字符串中的i.split(‘.)]将给出包含单词列表的句子列表。希望这有帮助!你需要把课文分成句子,然后再分成单词。你如何决定一个句子的结尾可能很难。你看过python的NLTK包了吗?[i.split(“”)代表字符串中的i.split(‘.)]将给出包含单词列表的句子列表。希望这有帮助!在@roman_js给出的示例规范中,
如果我们以第一句话为例,这里是我需要成为数组第一个元素的地方:['lorem',ipsum','is','simply','dummy','text','of','the','printing','and','typesetting','industry']
列表末尾没有句点“.”。在@roman\u js给出的示例规范中,只需一个小的适度,