Replace python：如何从多个目录中替换或删除多个文件中的所有繁体中文字符串_Replace_Cjk

Replace python：如何从多个目录中替换或删除多个文件中的所有繁体中文字符串

replace

Replace python：如何从多个目录中替换或删除多个文件中的所有繁体中文字符串,replace,cjk,Replace,Cjk,我尝试将所有中文字符串替换为“#”，但似乎不起作用 import os,re path = 'F:\\project\\test' files = [] # r=root, d=directories, f = files for r, d, f in os.walk(path): for file in f: files.append(os.path.join(r, file)) for file in files: with open(file, 'rb')

我尝试将所有中文字符串替换为“#”，但似乎不起作用

import os,re
path = 'F:\\project\\test'
files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        files.append(os.path.join(r, file))
for file in files:
    with open(file, 'rb') as infile:
        while True:
            content = infile.readline()
            if re.match(r'(.*[\u4E00-\u9FA5]+)|([\u4E00-\u9FA5]+.*)', content.decode('utf-8')):
                print(content.decode('utf-8'))
                content.decode('utf-8').replace(content.decode('utf-8'),"#")
                print(content.decode('utf-8'))

我发现一些代码可以得到中文或非中文的txt格式（但我不知道如何使用）

我可以像这样替换英文字符

import fileinput,re
filename='F:\\project\\test\\test_script.txt'
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        #pattern = re.compile(r'[^\u4e00-\u9fa5]')
        #chinese = re.sub(pattern, '', str)
        print(line.replace('aaaa', '#'), end='')
        #print(chinese)

import fileinput,re
filename='F:\\project\\test\\test_script.txt'
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        pattern = re.compile(r'[^\u4e00-\u9fa5]')
        chinese = re.sub(pattern, '', str)
        # print(line.replace('aaaa', '#'), end='')
        print(line.replace(chinese, '#'), end='')

但是如果txt文件包含像

import fileinput,re
filename='F:\\project\\test\\test_script.txt'
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        #pattern = re.compile(r'[^\u4e00-\u9fa5]')
        #chinese = re.sub(pattern, '', str)
        print(line.replace('aaaa', '#'), end='')
        #print(chinese)

import fileinput,re
filename='F:\\project\\test\\test_script.txt'
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        pattern = re.compile(r'[^\u4e00-\u9fa5]')
        chinese = re.sub(pattern, '', str)
        # print(line.replace('aaaa', '#'), end='')
        print(line.replace(chinese, '#'), end='')

控制台将显示UnicodeDecodeError:“cp950”编解码器无法解码位置2:非法多字节序列中的字节0xa0 和txt文件将为空

python字符串是不可变的，所以替换内容会创建一个具有不同内容的新字符串，它无法正常工作

与上述内容相关一旦你从文件中读取了字符串，它就不再是真正相关的了，如果你想修改文件，你需要在某个时候写回它（如果你不想修改，那么就继续）

如果您假设您只处理utf-8文件，那么可以使用“encoding='utf-8'，并从模式中删除

标志，Python将自行进行编码和解码

content.replace（content，“#”）意思是用一个#
替换整行，而不仅仅是CJK数据


regex模块支持直接搜索和替换，使用静态替换或回调函数：re.sub
（其中“sub”表示“替换”）
也不知道为什么你收集了一个大文件列表中的所有文件，然后才执行替换，为什么你不在os.walk迭代中思考呢
请注意，您指定的范围仅为BMP CJK范围，自那时以来已有6个“astral”扩展（CJK统一表意文字扩展A至F），目前正在计划第7个扩展，更不用说BMP内的旧“兼容性”范围（U+F900–U+FAFF）
也不知道为什么你不会达到U+9FFF，这是范围的实际终点，尽管U+9FF0和更高目前还没有分配
还请注意，统一的CJK范围涵盖所有基于汉族的脚本，不仅包括繁体中文，还包括简体中文、日文（汉字）、韩文（汉字）和越南语（汉字）ữ nôm）。unicode中还有其他非统一的范围，例如U+5169是一个繁体中文字符