Python 将HTML样式的文本注释解析为字典列表_Python_Html_Parsing_Html Parsing_Nlp

Python 将HTML样式的文本注释解析为字典列表

python html parsing nlp

Python 将HTML样式的文本注释解析为字典列表,python,html,parsing,html-parsing,nlp,Python,Html,Parsing,Html Parsing,Nlp,目前我有以下问题：给一串 "<a>annotated <b>piece</b></a> of <c>text</c>" 我之前的尝试看起来像 open_tag = '<[a-z0-9_]+>' close_tag = '<\/[a-z0-9_]+>' tag_def = "(" + open_tag + "|" + close_tag + ")" def tokenize(str):

目前我有以下问题：

给一串

"<a>annotated <b>piece</b></a> of <c>text</c>"

我之前的尝试看起来像

open_tag = '<[a-z0-9_]+>'
close_tag = '<\/[a-z0-9_]+>'
tag_def = "(" + open_tag + "|" + close_tag + ")"

def tokenize(str):
    """
    Takes a string and converts it to a list of words or tokens
    For example "<a>foo</a>, of" -> ['<a>', 'foo', '</a>', ',' 'of']
    """
    tokens_by_tag = re.split(tag_def, str)
    def tokenize(token):
        if not re.match(tag_def, token):
            return word_tokenize(token)
        else:
            return [token]
    return list(chain.from_iterable([tokenize(token) for token in tokens_by_tag]))

def annotations(tokens):
    """
    Process tokens into a list with {word : [tokens]} items
    """
    mapping = []
    curr = []
    for token in tokens:
        if re.match(open_tag, token):
            curr.append(re.match('<([a-z0-9_]+)>',token).group(1))
        elif re.match(close_tag, token):
            tag = re.match('<\/([a-z0-9_]+)>',token).group(1)
            try:
                curr.remove(tag)
            except ValueError:
                pass
        else:
            mapping.append({token: list(curr)})
    return mapping

open_标签=“”
关闭标签=“”
tag_def=“（“+open_tag+”|“+close_tag+”）
def标记化（str）：
"""
获取字符串并将其转换为单词或标记的列表
例如“foo，of”->[“foo”，“foo”，“of”]
"""
标记按标记=重新拆分（标记定义，str）
def令牌化（令牌）：
如果没有重新匹配（标记、标记）：
返回单词\u标记化（标记）
其他：
返回[令牌]
返回列表（chain.from_iterable（[tokens中的token的tokenize（token））由标记生成）
def注释（标记）：
"""
将令牌处理到具有{word:[tokens]}项的列表中
"""
映射=[]
curr=[]
对于令牌中的令牌：
如果重新匹配（打开标记，标记）：
当前附加（重新匹配（''，标记）.group（1））
elif重新匹配（关闭标记、令牌）：
标记=重新匹配（“”，标记）。组（1）
尝试：
当前删除（标记）
除值错误外：
通过
其他：
append（{token:list（curr）}）
返回映射

不幸的是，这有一个缺陷，因为

（n=54）

解析为

{“n=54”：[]}

，而

（n=52）

解析为

[{“n=”：[]}，{52:[“n”]}]

，因此两个列表的长度不同，使得以后无法合并两个不同的列表

有没有一个好的策略来解析HTML/SGML样式的注释，使两个不同注释（但在其他方面相同）的字符串产生一个大小相同的列表

注意，我很清楚regexp不适合这种类型的解析，但也不是本例中的问题

EDIT修复了示例中的一个错误

您的

xml

数据（或

html

）格式不正确

假设输入文件包含以下格式良好的

xml

数据：

<root><a>annotated <b>piece</b></a> of <c>text</c></root>

像这样运行：

python3 script.py xmlfile

这将产生：

[
    {'annotated ': ['root', 'a']}, 
    {'piece': ['root', 'a', 'b']}, 
    {' of ': ['root']}, 
    {'text': ['root', 'c']}
]

结果列表中的“注释”一词发生了什么变化？为什么“some”受“a”注释的约束？我认为你犯了一个错误……一些在a注释中\谢谢，@frankieliuzzi和rob我纠正了错误

python3 script.py xmlfile

[
    {'annotated ': ['root', 'a']}, 
    {'piece': ['root', 'a', 'b']}, 
    {' of ': ['root']}, 
    {'text': ['root', 'c']}
]