Python 产生带有反斜杠但不包括注释块的连接行_Python_Generator

Python 产生带有反斜杠但不包括注释块的连接行

python

Python 产生带有反斜杠但不包括注释块的连接行,python,generator,Python,Generator,当前正在尝试创建一个生成器函数，该函数一次生成一行文件，同时忽略注释块并将末尾带有反斜杠的行连接到下一行。因此，对于这段文字： # this entire line is a comment - don't include it in the output <line0> # this entire line is a comment - don't include it in the output <line1># comment <line2> # thi

当前正在尝试创建一个生成器函数，该函数一次生成一行文件，同时忽略注释块并将末尾带有反斜杠的行连接到下一行。因此，对于这段文字：

# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 \
line3.2 \
line3.3>
<line4.1 \
line4.2>
<line5># comment \
# more comment1 \
more comment2>
<line6>
# here's a comment line continued to the next line \
this line is part of the comment from the previous line

这将产生以下输出：

<line0>
<line1>
<line2>
<line3.1 
line3.2 
line3.3>
<line4.1 
line4.2>
<line5>
more comment2>
<line6>
this line is part of the comment from the previous line

您有两个运算符，

和

。后者优先于前者。这意味着您应该先检查并处理它。以下是使用列表作为缓冲区来建立行的一种简单方法：

def my_generator(f):
    buffer = []
    for line in f:
        line = line.rstrip('\n')
        if line.endswith('\\'):
            buffer.append(line[:-1])
            continue
        line = ''.join(buffer) + line
        buffer = []
        if '#' in line:
            line = line[:line.index('#')]
        if line:
            yield line

包装一个iterable行并使用ducktyping的好处是，您可以传入任何行为类似于字符串容器的内容，而不仅仅是文本文件：

text = """# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 \
line3.2 \
line3.3>
<line4.1 \
line4.2>
<line5># comment \
# more comment1 \
more comment2>
<line6>
# here's a comment line continued to the next line \
this line is part of the comment from the previous line'"""

for line in my_generator(text.splitlines()):
    print(line)

您有两个运算符，

和

。后者优先于前者。这意味着您应该先检查并处理它。以下是使用列表作为缓冲区来建立行的一种简单方法：

def my_generator(f):
    buffer = []
    for line in f:
        line = line.rstrip('\n')
        if line.endswith('\\'):
            buffer.append(line[:-1])
            continue
        line = ''.join(buffer) + line
        buffer = []
        if '#' in line:
            line = line[:line.index('#')]
        if line:
            yield line

包装一个iterable行并使用ducktyping的好处是，您可以传入任何行为类似于字符串容器的内容，而不仅仅是文本文件：

text = """# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 \
line3.2 \
line3.3>
<line4.1 \
line4.2>
<line5># comment \
# more comment1 \
more comment2>
<line6>
# here's a comment line continued to the next line \
this line is part of the comment from the previous line'"""

for line in my_generator(text.splitlines()):
    print(line)

我建议使用

re.sub

方法

def line_gen(text: str):

    text = re.sub(r"\s+\\\n", '', text)   # Remove any \ break
    text = re.sub(r"#(.*)\n", '\n', text) # Remove any comment
    # If the last line it is a comment it won't have a final \n.
    # We have to remove it as well.
    text = re.sub(r"#.*", '', text) 

    for line in text.rsplit():  # Using rsplit here we get ride of all unwanted spaces.
        yield line


with open("/tmp/data.txt") as f:
    text = f.read()

    for line in line_gen(text):
        print(line)

data.txt的内容

# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 \
line3.2 \
line3.3>
<line4.1 \
line4.2>
<line5># comment \
# more comment1 \
more comment2>
<line6>
# here's a comment line continued to the next line \
this line is part of the comment from the previous line

#整行都是注释-不要将其包含在输出中
#这整行都是注释-不要将其包含在输出中
#评论
#这整行都是注释-不要将其包含在输出中
#评论\
#更多评论1\
更多评论2>
#这里有一行评论，继续到下一行\
这一行是前一行注释的一部分

结果：

<line0>
<line1>
<line2>
<line3.1line3.2line3.3>
<line4.1line4.2>
<line5>
<line6>

我建议使用

re.sub

方法

def line_gen(text: str):

    text = re.sub(r"\s+\\\n", '', text)   # Remove any \ break
    text = re.sub(r"#(.*)\n", '\n', text) # Remove any comment
    # If the last line it is a comment it won't have a final \n.
    # We have to remove it as well.
    text = re.sub(r"#.*", '', text) 

    for line in text.rsplit():  # Using rsplit here we get ride of all unwanted spaces.
        yield line


with open("/tmp/data.txt") as f:
    text = f.read()

    for line in line_gen(text):
        print(line)

data.txt的内容

# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 \
line3.2 \
line3.3>
<line4.1 \
line4.2>
<line5># comment \
# more comment1 \
more comment2>
<line6>
# here's a comment line continued to the next line \
this line is part of the comment from the previous line

#整行都是注释-不要将其包含在输出中
#这整行都是注释-不要将其包含在输出中
#评论
#这整行都是注释-不要将其包含在输出中
#评论\
#更多评论1\
更多评论2>
#这里有一行评论，继续到下一行\
这一行是前一行注释的一部分

结果：

<line0>
<line1>
<line2>
<line3.1line3.2line3.3>
<line4.1line4.2>
<line5>
<line6>

由于某些原因，当我运行代码时，连在一起的行没有任何空格，当我使用

line=''.join（linebuff）+line

时，空格仅出现在

@JimT之后。除非使用不同的文本，否则空格在反斜杠前面的行中。如果你想要最后一个空格，你可以这样做。'.join（buffer）+'+line
我使用的是相同的文本-这段代码确实会在第3行和第4行中的每个项目后面产生一个空格，但是现在除了第5行之外，每隔一行的开头也会有空格。行之后还有两行空格2@JimT. 我不知道该告诉你什么。我将代码直接复制并粘贴到编辑器中，然后将结果复制并粘贴回来。您是在修改代码中的任何内容，还是手动输入？代码还没有被修改，我不知道将文件读入Python与在代码中包含文本是否有区别？出于某种原因，当我运行代码时，以及当我使用line=''.join（linebuff）+line
，空格仅出现在@JimT之后。除非使用不同的文本，否则空格在反斜杠前面的行中。如果你想要最后一个空格，你可以这样做。'.join（buffer）+'+line
我使用的是相同的文本-这段代码确实会在第3行和第4行中的每个项目后面产生一个空格，但是现在除了第5行之外，每隔一行的开头也会有空格。行之后还有两行空格2@JimT. 我不知道该告诉你什么。我将代码直接复制并粘贴到编辑器中，然后将结果复制并粘贴回来。您是在修改代码中的任何内容，还是手动输入？代码没有被修改，我不知道将文件读入Python与在代码中包含文本是否有区别？您可以使用内置的方法re.sub，这样就可以减少问题，替换两个字符串模式，代码就会更简单，可读性和更好的性能。您可以使用内置方法re.sub，这样可以减少问题，以替换两个字符串模式，您的代码将更简单、可读性更好。当文本在代码中时，此解决方案也非常有效，但当我使用file\u name=open（path/to/file.txt，'r'）
读入文件，然后使用Lines=file\u name.read（）
，在function@JimT. 我开始怀疑你的档案有问题。我可以完美地再现Raydel的结果，但我们的两种解决方案都没有问题。@JimT我调整了解决方案，以便在从文件中获取文本时获得正确的结果。@RaydelMiranda这正是我和疯狂物理学家之前得到的输出，但是理想情况下，如果希望第3行和第4行的输出分别是
和
和
的话，在正则表达式中无所事事……\\\n
会导致当文本在代码中时，这个解决方案也非常有效，但当我使用file\u name=open（path/to/file.txt，'r'）读取文件时就不行了
然后Lines=file_name.read（）
，在function@JimT. 我开始怀疑你的档案有问题。我可以完美地再现Raydel的结果，但我们的两种解决方案都没有问题。@JimT我调整了解决方案，以便在从文件中获取文本时获得正确的结果。@RaydelMiranda这正是我和疯狂物理学家之前得到的输出，但理想情况下，我们希望第3行和第4行的输出是
和
<line0>
<line1>
<line2>
<line3.1line3.2line3.3>
<line4.1line4.2>
<line5>
<line6>