用Python替换文本中的几个单词_Python_Performance_Unicode_Processing Efficiency

用Python替换文本中的几个单词

python performance unicode

用Python替换文本中的几个单词,python,performance,unicode,processing-efficiency,Python,Performance,Unicode,Processing Efficiency,我使用下面的代码从文件中删除所有HTML标记并将其转换为纯文本。此外，我必须将XML/HTML字符转换为ASCII字符。在这里，我有21行阅读全文。这意味着如果我想转换一个巨大的文件，我必须花费大量的资源来完成您有没有办法提高代码的效率和速度，同时减少资源的使用？ # -*- coding: utf-8 -*- import re # This file contains HTML. file = open('input-file.html', 'r') temp = file.read()

我使用下面的代码从文件中删除所有HTML标记并将其转换为纯文本。此外，我必须将XML/HTML字符转换为ASCII字符。在这里，我有21行阅读全文。这意味着如果我想转换一个巨大的文件，我必须花费大量的资源来完成

您有没有办法提高代码的效率和速度，同时减少资源的使用？

# -*- coding: utf-8 -*-
import re

# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()

# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('&lsquo;',"""'""")
temp = temp.replace ('&rsquo;',"""'""")
temp = temp.replace ('&ldquo;',"""\"""")
temp = temp.replace ('&rdquo;',"""\"""")
temp = temp.replace ('&sbquo;',""",""")
temp = temp.replace ('&prime;',"""'""")
temp = temp.replace ('&Prime;',"""\"""")
temp = temp.replace ('&laquo;',"""«""")
temp = temp.replace ('&raquo;',"""»""")
temp = temp.replace ('&lsaquo;',"""‹""")
temp = temp.replace ('&rsaquo;',"""›""")
temp = temp.replace ('&amp;',"""&""")
temp = temp.replace ('&ndash;',""" – """)
temp = temp.replace ('&mdash;',""" — """)
temp = temp.replace ('&reg;',"""®""")
temp = temp.replace ('&copy;',"""©""")
temp = temp.replace ('&trade;',"""™""")
temp = temp.replace ('&para;',"""¶""")
temp = temp.replace ('&bull;',"""•""")
temp = temp.replace ('&middot;',"""·""")

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()

#-*-编码：utf-8-*-
进口稀土
#此文件包含HTML。
file=open（'input-file.html'，'r'）
temp=file.read（）
#将一些XML/HTML字符替换为ASCII字符。
temp=temp.replace（“&lsquo；”，“”）
temp=temp.replace（“&rsquo；”，“”）
temp=temp.replace（“&ldquo；”、“\”）
temp=temp.replace（“&rdquo；”、“\”）
temp=temp.replace（“&sbquo；”、“”、“”）
temp=temp.replace（“&prime；”，“”）
temp=temp.replace（“&Prime；”，“\”）
temp=临时替换（“«；”、“«””）
temp=temp.replace（“»；”，“»”）
temp=temp.replace（“&lsaqo；”，“嫀”等）
temp=temp.replace（“&rsaquo；”，“›””）
临时=临时替换（“&；”、“&；”）
temp=temp.replace（“&ndash；”，“–”）
temp=temp.replace（“&mdash；”，“-”）
temp=temp.replace（“®；”，“®”）
temp=temp.replace（“©；”，“”“”）“”）
temp=temp.replace（“&trade；”，“”）™""")
temp=temp.replace（“¶；”，“”““””）
temp=temp.replace（“&bull；”，“•”）
temp=temp.replace（“·；”，“·””）
#用空字符串替换HTML标记。
结果=re.sub（“，”，temp）
打印（结果）
#将结果写入新文件。
文件=打开（“output file.txt”、“w”）
file.write（结果）
file.close（）文件

您可以使用string.translate（）

请注意，在python3中，str.translate将比python2慢得多，特别是当您只翻译几个字符时。这是因为它必须处理unicode字符，因此使用dict来执行翻译，而不是索引字符串。

我的第一直觉是与此相结合，这将只进行一次转换，而不是每次调用都会对整个字符串进行自己的传递，您希望避免这种情况

例如：

from string import ascii_lowercase, maketrans, translate

from_str = ascii_lowercase
to_str = from_str[-1]+from_str[0:-1]
foo = 'the quick brown fox jumps over the lazy dog.'
bar = translate(foo, maketrans(from_str, to_str))
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.

使用

sting.tranlate（）

或

string.maketran（）

的问题在于，当我使用它们时，我必须将一个字符分配给另一个字符

print string.maketran("abc","123")

但是，我需要为ASCII格式的单引号（

“

）分配一个类似于

”的HTML/XML字符。这意味着我必须使用以下代码：
print string.maketran("&lsquo;","'")

它面临以下错误：
ValueError:maketrans参数必须具有相同的长度
然而，如果我使用HTMLPasser，它会将所有HTML/XML转换为ASCII，而不会出现上述问题。我还添加了一个encode（'utf-8'）
，以解决以下错误：
UnicodeEncodeError:“ascii”编解码器无法在中对字符u'\u201c'进行编码
位置246：序号不在范围内（128）
#-*-编码：utf-8-*-
进口稀土
从HTMLPasser导入HTMLPasser
#此文件包含HTML。
file=open（'input-file.txt'，'r'）
temp=file.read（）
#将所有XML/HTML字符替换为ASCII字符。
temp=HTMLParser.unescape.\uuuu func\uuu（HTMLParser，temp）
#用空字符串替换HTML标记。
结果=re.sub（“，”，temp）
#将文本编码为UTF-8以防止某些错误。
结果=结果。编码（'utf-8'）
打印（结果）
#将结果写入新文件。
文件=打开（“output file.txt”、“w”）
file.write（结果）
file.close（）文件

print string.maketran("&lsquo;","'")

# -*- coding: utf-8 -*-
import re
from HTMLParser import HTMLParser

# This file contains HTML.
file = open('input-file.txt', 'r')
temp = file.read()

# Replace all XML/HTML characters to ASCII ones.
temp = HTMLParser.unescape.__func__(HTMLParser, temp)

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)

# Encode the text to UTF-8 for preventing some errors.
result = result.encode('utf-8')
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()