Python 从文本文件中的单词中删除特定标点符号_Python_Python 3.x

Python 从文本文件中的单词中删除特定标点符号

python python-3.x

Python 从文本文件中的单词中删除特定标点符号,python,python-3.x,Python,Python 3.x,我需要编写一个函数，get\u words\u from\u file（filename），它返回小写单词列表。函数应仅处理起始标记行和结束标记行之间的行。单词的顺序应与它们在文件中出现的顺序相同。下面是一个示例文本文件：baboosh.txt： *** START OF TEST CASE *** ......list of sentences here..... *** END OF TEST CASE *** This is after the end and should be ign

我需要编写一个函数，get\u words\u from\u file（filename），它返回小写单词列表。函数应仅处理起始标记行和结束标记行之间的行。单词的顺序应与它们在文件中出现的顺序相同。下面是一个示例文本文件：baboosh.txt：

*** START OF TEST CASE ***
......list of sentences here.....
*** END OF TEST CASE ***
This is after the end and should be ignored too.

以下是我的想法：

import re
from string import punctuation

def stripped_lines(lines):
    for line in lines:
        stripped_line = line.rstrip('\n')
        yield stripped_line


def lines_from_file(fname):
    with open(fname, 'rt') as flines:
        for line in stripped_lines(flines):
            yield line


def is_marker_line(line, start='***', end='***'):
    '''
    Marker lines start and end with the given strings, which may not
    overlap. (A line containing just '***' is not a valid marker line.)
    '''
    min_len = len(start) + len(end)
    if len(line) < min_len:
        return False
    return line.startswith(start) and line.endswith(end)

def advance_past_next_marker(lines):
    '''
    '''
    for line in lines:
        if is_marker_line(line):
            break


def lines_before_next_marker(lines):

    valid_lines = []
    for line in lines:
        if is_marker_line(line):
            break
        line.replace('"', '')
        valid_lines.append(line)


    for content_line in valid_lines:
        yield content_line


def lines_between_markers(lines):
    '''
    Yields the lines between the first two marker lines.
    '''
    it = iter(lines)
    advance_past_next_marker(it)
    for line in lines_before_next_marker(it):
        yield line


def words(lines):
    text = '\n'.join(lines).lower().split()
    return text


def get_words_from_file(fname):
    return words(lines_between_markers(lines_from_file(fname)))

#This is the test code that must be executed
filename = "baboosh.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

重新导入
从字符串导入标点符号
def剥离_管路（管路）：
对于行中的行：
stripped_line=line.rstrip（'\n'）
屈服线
来自文件（fname）的def行：
以开放式（fname，'rt'）作为燧石：
对于剥离管线（燧石）中的管线：
生产线
def是_marker_line（line，start='***'，end='***'）：
'''
标记行以给定的字符串开始和结束，而给定的字符串可能不是
重叠（仅包含“***”的行不是有效的标记行。）
'''
最小长度=长度（开始）+长度（结束）
如果长度（线）<最小长度：
返回错误
返回行.startswith（开始）和行.endswith（结束）
def advance_PASS_next_标记（行）：
'''
'''
对于行中的行：
如果是标记线（行）：
打破
def行在下一个标记之前（行）：
有效的_行=[]
对于行中的行：
如果是标记线（行）：
打破
行。替换（“，”）
有效的\u行。追加（行）
对于有效\u行中的内容\u行：
产量线
def行_在_标记（行）之间：
'''
产生前两个标记线之间的线。
'''
it=国际热核实验堆（线）
前进超过下一个标记（it）
对于在下一个标记之前的行中的行（it）：
生产线
定义字（行）：
text='\n'.连接（行）.lower（）.split（）
返回文本
def从文件（fname）中获取单词：
返回单词（_标记之间的行（来自_文件（fname））的行）
#这是必须执行的测试代码
filename=“baboosh.txt”
words=从文件（文件名）中获取单词
打印（文件名，“已加载确定”）
打印（“{}找到有效单词。”。格式（len（单词）））
打印（“有效单词列表：”）
用文字表示：
打印（word）

我的输出

我得到了正确的单词列表。但是当打印出来时，我得到了诸如冒号、分号和句号之类的标点符号。我不知道如何去除这些

如何执行此操作？

使用

re.split

而不是

str.split

。如果您设置这样的编译正则表达式：

splitter = re.compile('[ ;:".]')

然后，您可以使用以下命令拆分行：

word_list = splitter.split(line)

这将返回不带标点符号的单词。

您可以检查此选项-