Python 用于挖掘文件内容的正则表达式_Python_Regex_Text Mining

Python 用于挖掘文件内容的正则表达式

python regex

Python 用于挖掘文件内容的正则表达式,python,regex,text-mining,Python,Regex,Text Mining,我有一个如下所示的文本文件： <Author>Marilyn1949 <Content>great way of doing things. can you provide more info.blah blah blah.. <Date>Dec 1, 2008... (file content continues in similar fashion for other authors)" Marilyn1949 伟大的做事方式。你能提供更多的信息吗

我有一个如下所示的文本文件：

<Author>Marilyn1949
<Content>great way of doing things.
can you provide more info.blah blah blah..  
<Date>Dec 1, 2008...
(file content continues in similar fashion for other authors)"

Marilyn1949
伟大的做事方式。
你能提供更多的信息吗。。
2008年12月1日。。。
（对于其他作者，文件内容将以类似的方式继续）

我正在尝试使用下面的代码提取内容部分。您能帮我找出我缺少的内容吗？因为我的文件只是一个正在被格式化为[]数组的文件

text_file = open("output/out.txt", "w")
for file in os.listdir("./"):
    if glob.fnmatch.fnmatch(file, '*.txt'):
        with open(file, "r") as source:
            L= source.read()                
            pattern =  re.compile(r'<Content>*<Date>')              
            for match in L:
                result = re.findall(r'<Content>.*<Date>', match)
                text_file.write(str(result))
                text_file.write('\n')

text\u file=open（“output/out.txt”、“w”）
对于os.listdir（“./”）中的文件：
如果glob.fnmatch.fnmatch（文件“*.txt”）：
以open（文件“r”）作为源：
L=source.read（）
pattern=re.compile（r'*'）
对于L中的匹配：
结果=re.findall（r'.*'，匹配）
text_file.write（str（结果））
text\u file.write（'\n'）

点字符与除换行符以外的任何字符都匹配。请使用使其与换行符匹配：

result = re.findall(r'<Content>.*<Date>', match, flags=re.DOTALL)

result=re.findall（r'.*'，match，flags=re.DOTALL）

此外，您可能不希望捕获标记：

result = re.findall(r'<Content>(.*)<Date>', match, flags=re.DOTALL)

result=re.findall（r'（.*），匹配，标志=re.DOTALL）

并稍微整理一下您的示例：

with open(file, "r") as source:
    results = re.findall(r'<Content>(.*?)<Date>', source.read(), flags=re.DOTALL)
    text_file.write('\n'.join(results))

以open（文件“r”）作为源：
results=re.findall（r'（.*），source.read（），flags=re.DOTALL）
text_file.write（'\n'.join（结果））

感谢您编辑Tichotraic。类型错误：需要字符串或缓冲区。在编写或使用字典之前，我需要转换为str吗？太棒了！效果非常好。GRC..谢谢您的帮助。：）