python-如何解析半结构化文本（cran.all.1400）_Python_Parsing_Text Parsing

python-如何解析半结构化文本（cran.all.1400）

python parsing

python-如何解析半结构化文本（cran.all.1400）,python,parsing,text-parsing,Python,Parsing,Text Parsing,我需要处理文本文件它是一个文章摘要的集合，每个文章都有一些传统的数据。其形式如下： .I 1 .T 飞机空气动力学实验研究滑流中的机翼。 .A 布伦克曼，m. .B Jae。scs。251958324. .W //大量文本 .I 2 .T 小型不可压缩流体中平板的简单剪切流粘度。 .A 丁伊利 .B 伦斯勒理工学院航空工程系研究所纽约州特洛伊市 .W //大量文本等等我需要的是这样组织的数据：第1条：.T=“无论第一条的标题是什么，.A=”作者是谁，.B=”作者是谁，.T=”

我需要处理文本文件

它是一个文章摘要的集合，每个文章都有一些传统的数据。其形式如下：

.I 1
.T
飞机空气动力学实验研究滑流中的机翼。
.A
布伦克曼，m.
.B
Jae。scs。251958324.
.W
//大量文本
.I 2
.T
小型不可压缩流体中平板的简单剪切流粘度。
.A
丁伊利
.B
伦斯勒理工学院航空工程系研究所纽约州特洛伊市
.W
//大量文本

等等

我需要的是这样组织的数据：

第1条：.T=“无论第一条的标题是什么，.A=”作者是谁，.B=”作者是谁，.T=”所有文本”
第2条：.T=“无论标题是什么，.A=”作者是谁，.B=”作者是谁，.T=”所有文本”

我将如何在Python中执行此操作？

谢谢您的时间。

您从

.I

上的拆分评论中得出的想法似乎是一个好的开始

以下似乎有效：

with open('crantest.txt') as f:
    articles = f.read().split('\n.I')

def process(i, article):
    article = article.replace('\n.T\n','.T=')
    article = '.T=' + article.split('.T=')[1] #strips off the article number, restored below
    article = article.replace('\n.A\n',',.A=')
    article = article.replace('\n.B\n',',.B=')
    article = article.replace('\n.W\n',',.W=')
    return 'article ' + str(i) + ':' + article

data = [process(i+1, article) for i,article in enumerate(articles)]

我创建了一个只包含前10篇文章的测试文件（丢弃一个小标题和所有以

.i11

开头的文件）。当我运行上述代码时，我得到一个长度为10的列表。非常重要的是，第一行开始于

.I

（没有之前的换行符），因为我不努力测试拆分的第一个条目是否为空。列表中的第一个条目是一个字符串，开头为：

article 1:.T=experimental investigation of the aerodynamics of a\nwing in a slipstream .,.A=brenckman,m.,.B=j. ae. scs. 25, 1958, 324.,.W=experimental investigation of the aerodynamics of a\nwing in a slipstream

编辑时这是一个字典版本，它使用

分区

连续提取相关块。它返回字典字典，而不是字符串列表：

with open('crantest.txt') as f:
    articles = f.read().split('\n.I')

def process(article):
    article = article.split('\n.T\n')[1]
    T, _, article = article.partition('\n.A\n')
    A, _, article = article.partition('\n.B\n')
    B, _, W = article.partition('\n.W\n')
    return {'T':T, 'A':A, 'B':B, 'W':W}

data = {(i+1):process(article) for i,article in enumerate(articles)}

例如：

>>> data[1]
{'A': 'brenckman,m.', 'T': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .', 'B': 'j. ae. scs. 25, 1958, 324.', 'W': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .\n  an experimental study of a wing in a propeller slipstream was\nmade in order to determine the spanwise distribution of the lift\nincrease due to slipstream at different angles of attack of the wing\nand at different free stream to slipstream velocity ratios .  the\nresults were intended in part as an evaluation basis for different\ntheoretical treatments of this problem .\n  the comparative span loading curves, together with\nsupporting evidence, showed that a substantial part of the lift increment\nproduced by the slipstream was due to a /destalling/ or\nboundary-layer-control effect .  the integrated remaining lift\nincrement, after subtracting this destalling lift, was found to agree\nwell with a potential flow theory .\n  an empirical evaluation of the destalling effects was made for\nthe specific configuration of the experiment .'}

s.partition（）。代码中的下划线（\uu
）是一种Python习惯用法，它强调了用户的意图是放弃返回值的该部分。
您尝试了什么？看起来你的关键词是由一个点、一个大写字母和可选属性组成的，单独在一行和规则行上。只要逐行处理文件，如果您的文件被卡在某个地方，就可以到这里来询问更精确的问题。我尝试将整个文件作为单个字符串读取（使用read），然后使用.I作为分隔符来拆分该字符串。这给了我一个文章列表（开头有一个空元素，但我可以管理它）。现在我需要用其他标签/关键字来分解文章，但仍然知道哪个元素属于哪个文章。我想我需要一本字典或一个表/2D数组。如果我逐行处理文本，我不知道如何将行放在正确的位置。这非常接近我需要的。然而，恐怕我的问题还不够清楚。我也需要把绳子拆开。我需要某种数据结构，可以从中访问，例如，第7篇文章的.W部分。列表列表或类似的东西。我还是不确定我说得是否清楚。但是“对于我来说，enumerate（articles）中的文章”是一个很大的帮助，我想我可以使用它来达到我需要达到的目的。非常感谢。听起来你想要一个字典列表：每篇文章一个字典，每个字典都有键'T'
，'a'
，'B'
，和'W'
。是的，听起来很对！：）今天晚些时候，我将尝试修改您的代码以实现此目的，我还需要先完成一些其他工作。谢谢你的帮助@我添加了第二种基于词典的方法。太棒了！这正是我需要的，完美！非常感谢你！