使用python从html中删除带图案的文本_Python_Html_Regex

使用python从html中删除带图案的文本

python html regex

使用python从html中删除带图案的文本,python,html,regex,Python,Html,Regex,我尝试使用python编写脚本，删除html中的特定模式文本。但是，我的代码似乎不起作用。。你能帮我查一下哪里出了问题吗 import os, re cwd = os.getcwd() print ('Now you are at this directory: \n' + cwd) # find files that have an extension with HTML Files = os.listdir(cwd) print Files def func(file): fo

我尝试使用python编写脚本，删除html中的特定模式文本。但是，我的代码似乎不起作用。。你能帮我查一下哪里出了问题吗

import os, re

cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)

# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files

def func(file):
    for file in os.listdir(cwd):
        if file.endswith('.html'):
            for line in open(file):
                re.sub(r'<strong>.*?<\/strong>', '', line)
                # I feel the above line has some problems
func(file)

导入操作系统，重新
cwd=os.getcwd（）
print（'现在您在这个目录下：\n'+cwd）
#查找具有HTML扩展名的文件
Files=os.listdir（cwd）
打印文件
def func（文件）：
对于os.listdir（cwd）中的文件：
如果文件.endswith（'.html'）：
对于打开的行（文件）：
re.sub（r'*？'，''，第行）
#我觉得上面这行有一些问题
func（文件）

提前非常感谢
你不必逃避/陷入困境
\/
实际上只是一个普通的
/
。有关完整的参考资料，请参见的简介
您的正则表达式应该是：
r'*？'
但是，不建议使用正则表达式解析html。看看吧

line = 'some text, SOME STRONG TEXT and again STONG TEXT' re.sub(r'.*?<\/strong>', '', line) #'some text, and again '

line='一些文本，一些强文本，然后再次强文本' re.sub（r'*？'，''，第行） #“一些文本，然后再次”
你不必逃避现实
\/
实际上只是一个普通的
/
。有关完整的参考资料，请参见的简介
您的正则表达式应该是：
r'*？'
但是，不建议使用正则表达式解析html。看看吧

line = 'some text, SOME STRONG TEXT and again STONG TEXT' re.sub(r'.*?<\/strong>', '', line) #'some text, and again '

line='一些文本，一些强文本，然后再次强文本' re.sub（r'*？'，''，第行） #“一些文本，然后再次”
希望这有帮助

import os, re cwd = os.getcwd() print ('Now you are at this directory: \n' + cwd) # find files that have an extension with HTML Files = os.listdir(cwd) def func(file): for file in os.listdir(cwd): if file.endswith('.html'): f = open(file, "r+") text = re.sub(r'\<strong\>.*\<\/strong\>',"",f.read()) f.close() f = open(file, "w") f.write(text) f.close() func(file)

导入操作系统，重新 cwd=os.getcwd（） print（'现在您在这个目录下：\n'+cwd） #查找具有HTML扩展名的文件 Files=os.listdir（cwd） def func（文件）：对于os.listdir（cwd）中的文件：如果文件.endswith（'.html'）： f=打开（文件“r+”） text=re.sub（r'\.\'，''，f.read（）） f、关闭（） f=打开（文件“w”） f、书写（文本） f、关闭（） func（文件）
希望这有帮助

import os, re cwd = os.getcwd() print ('Now you are at this directory: \n' + cwd) # find files that have an extension with HTML Files = os.listdir(cwd) def func(file): for file in os.listdir(cwd): if file.endswith('.html'): f = open(file, "r+") text = re.sub(r'\<strong\>.*\<\/strong\>',"",f.read()) f.close() f = open(file, "w") f.write(text) f.close() func(file)

导入操作系统，重新 cwd=os.getcwd（） print（'现在您在这个目录下：\n'+cwd） #查找具有HTML扩展名的文件 Files=os.listdir（cwd） def func（文件）：对于os.listdir（cwd）中的文件：如果文件.endswith（'.html'）： f=打开（文件“r+”） text=re.sub（r'\.\'，''，f.read（）） f、关闭（） f=打开（文件“w”） f、书写（文本） f、关闭（） func（文件）
谢谢你，蒂埃里，我一定会去看看beautifulsoup！对于正则表达式，我尝试了两种模式，但它们都不起作用……如果您使用我的原始脚本并尝试打印出匹配的文本，它们实际上是正确的。我只是不确定代码中阻止我替换匹配字符串的部分是错误的…谢谢Thierry，我一定会查看beautifulsoup！对于正则表达式，我尝试了两种模式，但它们都不起作用……如果您使用我的原始脚本并尝试打印出匹配的文本，它们实际上是正确的。我只是不确定代码中阻止我替换匹配字符串的部分是错误的…谢谢，它成功了！！在我的情况下，我可能需要进一步尝试——看看漂亮的汤是否更有帮助。：）谢谢你，成功了！！在我的情况下，我可能需要进一步尝试——看看漂亮的汤是否更有帮助。：）