删除“\";在python中从文本文件打印句子时是什么时候?

删除“\";在python中从文本文件打印句子时是什么时候?,python,split,format,sentence,Python,Split,Format,Sentence,我正试图打印文本文件(古腾堡项目电子书之一)中的句子列表。当我将文件打印为单个字符串时,它看起来很好: file = open('11.txt','r+') alice = file.read() print(alice[:500]) 输出为: ALICE'S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice

我正试图打印文本文件(古腾堡项目电子书之一)中的句子列表。当我将文件打印为单个字符串时,它看起来很好:

file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
输出为:

ALICE'S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'

So she was considering in her own mind (as well as she could, for the
hot d
现在,当我把它分成几个句子时(作业是专门通过“在句点上拆分”来完成的,所以这是一个非常简单的拆分),我得到了:

>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']

额外的“\n”字符来自哪里?我如何删除它们?

您可能不想使用正则表达式,但我会:

import re
new_sentences = []
for s in sentences:
    new_sentences.append(re.sub(r'\n{2,}', '\n', s))
这将用一个换行符替换两个或多个
'\n'
的所有实例,因此您仍然有换行符,但没有“额外”换行符

如果您希望避免创建新列表,而是修改现有列表(归功于@gavriel和Andrew L.:我第一次发布答案时没有想到使用enumerate):


额外的换行符并不是真正的额外,我的意思是它们应该存在并且在你的问题中的文本中可见:越多的
'\n'
,文本行之间的可见空间就越多(即,章节标题和第一段之间有一个,版本和章节标题之间有许多。

通过这个小示例,您将了解
\n
字符的来源:

alice = """ALICE'S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'

So she was considering in her own mind (as well as she could, for the
hot d"""

print len(alice.split("."))
print len(alice.split("\n"))
这完全取决于拆分文本的方式,上面的示例将给出以下输出:

3
19
这意味着如果要使用
拆分文本,则有3个子字符串;如果使用
\n
作为分隔符拆分文本,则有19个子字符串。有关详细信息,请参阅


在您的情况下,您已经使用
拆分了文本,因此3个子字符串将包含多个换行符
\n
,若要删除这些换行符,您可以再次拆分这些子字符串,也可以使用删除它们如果要用一个空格替换所有换行符,请执行以下操作:

import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]

文本使用新行分隔句子和句号。您遇到了一个问题,即仅用空字符串替换新行字符将导致单词之间没有空格。在您将
alice
”拆分之前。
,我将使用@elethan解决方案中的一些内容来替换所有字符在
alice
中使用
'.
添加多行新行。然后您可以执行
alice.split('.')
,所有用多行新行分隔的句子将与最初用
分隔的句子一起适当拆分


那么你唯一的问题就是版本号中的小数点。

如果你在句子上分开,那到底是什么意思?你应该在所有的“换行符”上分开吗字符或每次看到句点时?如果是句点,您是否应该忽略第一章中的句点。
下兔子洞。
\n
字符表示换行符作为转义序列。@idjaw它非常模糊。这是一个非常基本的复习作业,因此根据我的假设,我不需要让我们深入了解构成句子的细节,从而证明我知道如何使用拆分函数——尽管我已经看过一些关于将字符串拆分成句子的问题线索,并且正在讨论使用更“正确”的方法代码比我提供的版本要好。非常好。我没有想到使用列表理解!
file = open('11.txt','r+')
file.read().split('\n')
file = open('11.txt','r+')
file.read().split('\n')