Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/facebook/8.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
正则表达式和Python-清理UTF8文本文件_Python_Regex - Fatal编程技术网

正则表达式和Python-清理UTF8文本文件

正则表达式和Python-清理UTF8文本文件,python,regex,Python,Regex,Python新手。使用2.7.3。法律助理学位必修课程的课程分配 我想阅读法庭陈述草案的UTF8文本文件,并整理如下 逐行读取输入文本文件 每行,, (1) 将句子的首字母大写(包括行的第一个字符) (2) 确保所有逗号、句号和分号后面都跟有空格字符 逐行写入输出文本文件 这是我在阅读其他stackoverflow帖子的基础上得出的结论。它不好用。请帮忙。多谢各位 import codecs import sys import os import re reload(sys) sys.set

Python新手。使用2.7.3。法律助理学位必修课程的课程分配

我想阅读法庭陈述草案的UTF8文本文件,并整理如下

逐行读取输入文本文件

每行,, (1) 将句子的首字母大写(包括行的第一个字符) (2) 确保所有逗号、句号和分号后面都跟有空格字符

逐行写入输出文本文件

这是我在阅读其他stackoverflow帖子的基础上得出的结论。它不好用。请帮忙。多谢各位


import codecs
import sys
import os
import re

reload(sys)
sys.setdefaultencoding('utf8')


with codecs.open('test.txt', 'r', encoding='utf8') as file:
    filedata = file.read().replace(' \r\n', '\r\n')

re.sub(r'(?<=[.,;])(?=[^\s])', r' ', filedata)

rtn = re.split('([.!?] *)', filedata)
filedata = ''.join([i.capitalize() for i in rtn])
filedata = filedata[0].upper() + filedata[1:] 


with codecs.open('output.txt', 'w') as file:
    file.write(filedata)

期望输出:

Instead of arguing, ask her, "What can I do?"  Forgo postponing the problem.  Instead, talk to her.  That single gesture will promote peace.
你可以试试这个

import re

filedata = 'instead of arguing,ask her, "what can i do?"forgo postponing the problem. instead, talk to her.that single gesture will promote peace.'

print(filedata)

# add space, match .,; but not followed by space \s
filedata = re.sub(r'([,\.;"])((?!\s).)', r'\1 \2', filedata)

# clean space in quotation: " What can i do?"
filedata = re.sub(r'"\s([^"]+")', r'"\1', filedata)

# make uppercase first letter of sentence or after dot and quote
filedata = re.sub(r'(^.|\.\s\w|"\s?\w)', lambda m: m.group(1).upper(), filedata)

print(filedata)

另一种选择是使用单个模式并检查捕获组是否存在,以进行不同的替换

("[^"]*"|\.|^)\s*(\S)|([,;])(?=\S)
解释

  • (“[^”]*“|\.|^)
    捕获组1,从
    “…”
    或点匹配,或断言字符串的开头
  • \s*(\s)
    匹配0+个空格字符并捕获组2中的非空格字符
  • |
  • ([,;])(?=\S)
    组3中捕获
    在右侧断言非空白字符

输出

Instead of arguing, ask her, "what can i do?" Forgo postponing the problem. Instead, talk to her. That single gesture will promote peace.

请注意,
\S
[^\S]
匹配非空白字符。

由于您是Python新手,您应该使用Python 3.x,最好是最新的。Python 3完全改变了unicode的处理方式,与Python 2.x相比有了很大的改进。Python 2.x实际上已经过时了。
import re

regex = r"""("[^"]*"|\.|^)\s*(\S)|([,;])(?=\S)"""
s = "instead of arguing,ask her, \"what can i do?\"forgo postponing the problem. instead, talk to her.that single gesture will promote peace."
result = re.sub(regex, lambda m: m.group(3) + " " if m.group(3) else (m.group(1) + " " + m.group(2).upper()), s)
print(result)
Instead of arguing, ask her, "what can i do?" Forgo postponing the problem. Instead, talk to her. That single gesture will promote peace.