Python：将RTF文件转换为unicode？_Python_Unicode

Python：将RTF文件转换为unicode？

python unicode

Python：将RTF文件转换为unicode？,python,unicode,Python,Unicode,我试图将RTF文件中的行转换为一系列unicode字符串，然后对这些行进行正则表达式匹配。（我需要它们是unicode，以便可以将它们输出到另一个文件。）然而，我的正则表达式匹配不起作用——我想是因为它们没有正确地转换为unicode 这是我的密码： usefulLines = [] textData = {} # the regex pattern for an entry in the db (e.g. SUF 76,22): it's sufficient for us to matc

我试图将RTF文件中的行转换为一系列unicode字符串，然后对这些行进行正则表达式匹配。（我需要它们是unicode，以便可以将它们输出到另一个文件。）

然而，我的正则表达式匹配不起作用——我想是因为它们没有正确地转换为unicode

这是我的密码：

usefulLines = []
textData = {}

# the regex pattern for an entry in the db (e.g. SUF 76,22): it's sufficient for us to match on three upper-case characters plus a space
entryPattern = '^([A-Z]{3})[\s].*$'  

f = open('textbase_1a.rtf', 'Ur')
fileLines = f.readlines()

# get the matching line numbers, and store in usefulLines
for i, line in enumerate(fileLines):
    #line = line.decode('utf-16be') # this causes an error: I don't really know what file encoding the RTF file is in...
    line = line.decode('mac_roman')
    print line
    if re.match(entryPattern, line):
        # now retrieve the following lines, all the way up until we get a blank line
        print "match: " + str(i)
        usefulLines.append(i)

目前，这会打印所有行，但不会打印任何匹配的内容，尽管它应该匹配。此外，出于某种原因，这些行在开始时打印有“/par”。当我尝试将它们打印到输出文件时，它们看起来非常奇怪

部分问题是我不知道要指定什么编码。我怎么才能知道呢

如果我使用

entryPattern='^.*$'

那么我会得到匹配项

有人能帮忙吗？

您甚至没有解码RTF文件。RTF不仅仅是简单的文本文件。例如，包含“äöü”的文件包含以下内容：

{\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial；}}

{*\generator Msftedit 5.41.15.1507；}\viewkind4\uc1\pard\f0\fs20\e4\f6\fc\par

}

在文本编辑器中打开时。因此，字符“äöü”编码为windows-1252，如文件开头所声明的（äöü=0xE4 0xF6 0xFC）

要读取RTF，首先需要将RTF转换为文本的东西（已经）。

不要使用正则表达式解析RTF文件。好的，我不知道。非常感谢。