Python 正则表达式匹配以逗号分隔的key=value列表，其中value可以包含html？_Python_Regex

Python 正则表达式匹配以逗号分隔的key=value列表，其中value可以包含html？

python regex

Python 正则表达式匹配以逗号分隔的key=value列表，其中value可以包含html？,python,regex,Python,Regex,我试图匹配一个以逗号分隔的key=value列表，其中的值可以很好地包含很多内容我使用的模式正是基于以下内容： split_-up_pattern=re.compile（r'（[^=]+）=（[^=]+）（？：，|$）'，re.X | re.M）但是，当值包含html时，它会导致问题下面是一个示例脚本： import re text = '''package_contents=<p>The basic Super 1050 machine includes

我试图匹配一个以逗号分隔的key=value列表，其中的值可以很好地包含很多内容

我使用的模式正是基于以下内容：

split_-up_pattern=re.compile（r'（[^=]+）=（[^=]+）（？：，|$）'，re.X | re.M）

但是，当值包含html时，它会导致问题

下面是一个示例脚本：

import re

text = '''package_contents=<p>The basic Super&nbsp;1050 machine includes the following:</p>
<p>&nbsp;</p>
<table style="" height: 567px;"" border=""1"">
<tbody>
<tr>
<td style=""width: 200px;"">
<ul>
<li>uper 1150 machine</li>
</ul>
</td>
<td>&nbsp;With dies fitted.
<ul>
<li>The Super 1050</li>
</ul>
</td>
</tr>
</tbody>
<table>,second_attribute=something else'''

split_up_pattern = re.compile(r'([\w_^=]+)=([^=]+)(?:,|$)', re.X|re.M)

matches = split_up_pattern.findall(text)

import ipdb; ipdb.set_trace()

print(matches)

匹配[1]

：

('second_attribute', 'something else')

您可以利用这样一个事实，即下一个键值对从以下内容开始，而不是仅基于分隔符（逗号或等号）进行解析：

,WORD=

下面是这个想法的草图：

import re

text = '''...your example...'''

# Start of the string or our ,WORD= pattern.
rgx_spans = re.compile(r'(\A|,)\w+=')

# Get the start-end positions of all matches.
spans = [m.span() for m in rgx_spans.finditer(text)]

# Use those positions to break up the string into parsable chunks.
for i, s1 in enumerate(spans):
    try:
        s2 = spans[i + 1]
    except IndexError:
        s2 = (None, None)

    start = s1[0]
    end = s2[0]
    key, val = text[start:end].lstrip(',').split('=', 1)

    print()
    print(s1, s2)
    print((key, val))

这是一个棘手的问题，你首先是如何获得这些数据的？有没有办法在源代码处标记它？如果不能保证分隔符不会出现在值中，那么所有关于正则表达式（以及大多数其他方法）解析的赌注都将被取消。您需要找到一个永远不会出现在值中的分隔符。您正在为多行正则表达式使用

re.M

标志。我怀疑这是一个错误。如果在HTML中的任何地方都没有出现逗号，那么将搜索更改为

split\u up\u pattern=re.compile（r'（[^=]+）=（[^，]+）（？：，|$）

。@Phylogenesis不幸的是出现了逗号。@fflegging我正在从电子商务网站导出。因此，输出为

csv

格式。但是，一列合并了所有的

附加属性

，如上所示

,WORD=

import re

text = '''...your example...'''

# Start of the string or our ,WORD= pattern.
rgx_spans = re.compile(r'(\A|,)\w+=')

# Get the start-end positions of all matches.
spans = [m.span() for m in rgx_spans.finditer(text)]

# Use those positions to break up the string into parsable chunks.
for i, s1 in enumerate(spans):
    try:
        s2 = spans[i + 1]
    except IndexError:
        s2 = (None, None)

    start = s1[0]
    end = s2[0]
    key, val = text[start:end].lstrip(',').split('=', 1)

    print()
    print(s1, s2)
    print((key, val))