无法使用Python重新使用Lookarounds成功拆分类似HTML的字符串 “” 正在尝试拆分下面的字符串“文本”,以便找到并拆分所有段。 细分市场有三种风格: A:一些文字 B: C:一些没有标签的文本 我不能使用HTML解析器,因为C段是纯英语的。 我希望能够接受“/”作为文本的一部分。 所以,“/”只是一个字符,除非它在“>”之前或在“tag”之前。 在一个完美的世界里,我也愿意接受“”。 所以“[^>]*)”#orig,适用于A和C,但不适用于B patt[2]=r'(|/>)缺一个 # 1 2 2 3 4 4 31 patt[3]=r'()'#错过了一个 # 1 2 2 1 patt[4]=r'(

无法使用Python重新使用Lookarounds成功拆分类似HTML的字符串 “” 正在尝试拆分下面的字符串“文本”,以便找到并拆分所有段。 细分市场有三种风格: A:一些文字 B: C:一些没有标签的文本 我不能使用HTML解析器,因为C段是纯英语的。 我希望能够接受“/”作为文本的一部分。 所以,“/”只是一个字符,除非它在“>”之前或在“tag”之前。 在一个完美的世界里,我也愿意接受“”。 所以“[^>]*)”#orig,适用于A和C,但不适用于B patt[2]=r'(|/>)缺一个 # 1 2 2 3 4 4 31 patt[3]=r'()'#错过了一个 # 1 2 2 1 patt[4]=r'(,python,regex,regex-lookarounds,Python,Regex,Regex Lookarounds,我认为您应该使用Python模块。看起来我解决了自己的问题……通过将flags=re.DOTALL传递给re.split()命令,我得到了预期的输出。因此,patt[4]它是…不是所有的片段都用HTML标记包装,所以我不认为BS4会起作用…你可以将它与你正在做的事情结合起来。我认为它可以避免你的错误 ''' Trying to split string 'text' below so that all segments are found and split. There are three f

我认为您应该使用Python模块。

看起来我解决了自己的问题……通过将
flags=re.DOTALL
传递给
re.split()
命令,我得到了预期的输出。因此,patt[4]它是…

不是所有的片段都用HTML标记包装,所以我不认为BS4会起作用…你可以将它与你正在做的事情结合起来。我认为它可以避免你的错误
'''
Trying to split string 'text' below so that all segments are found and split.
There are three flavors of segments:
    A:    <tag attr1="one" attr2>some text</tag>
    B:    <tag attr1="one" attr2 text="some text"/>
    C:    some tagless text
I can't use an HTML parser, because C segments are in plain English.

I would like to be able to accept '/' as part of the text.
So, '/' is just a char unless it precedes '>' or comes before 'tag'.
In a perfect world, I would also like to accept '<' and '>', too.
So '<' is just a char unless it precedes 't' or '/'.

Currently I use patt[1] to split text with A and C segments, but now I want to be
able to split strings with B segments, too. Also, patt[1] does not allow '/' to be
passed in as a regular char without breaking the routine. Although it seems 
to work, I don't like that the regexp limits matching according to '/'.

The goal is to properly split the text string into the expected output strings. Can someone 
tell me what I'm doing wrong?  Running the regexp on https://pythex.org/ looks okay, but it 
still doesn't run properly under CPython 3.6 - 3.9.
'''
import pprint, re, sys, unittest

class TestClass(unittest.TestCase):

    sub = False # True #
    patt = {}
    patt[1] = r'(<t(?:ag)?\s*[^>]*>[^>]*</t(?:ag)?>)' # orig, works for A and C but not B

    patt[2] = r'(<t(?:ag)?.*?(?:/t(?:ag)?>|/>))' # misses one
    #           1  2    2    3    4    4     31
    patt[3] = r'(<t(?:ag)?[^/]*/t?>)' # misses one
    #           1  2    2          1
    patt[4] = r'(<t(?:ag)?.*?(?<=/)(?:t(?:ag)?)?>)' # with lookarounds. still misses one.
    patt[5] = r'(<t(?:ag)?(?!/>)(?!/t).*?(?<=/)(?:t(?:ag)?)?>)' # with lookarounds. tried adding negs
    patt[6] = r'(<t(?:ag)?.*?(?!/>)(?!/t)(?<=/)(?:t(?:ag)?)?>)'

    text = "start: <t s=10 B=1>Size 10 bold</t><t siz=6>Size 6\nNew Line </t>" \
           "<t u w=bold>Underlined and Bolded\n</t><t it>Italics</t><t>default</t>" \
           '<t fam="Courier New" siz=18>Courier 18</t>' \
           "<t bitmap=question/><t bitmap=info/> and <t fg=red>red/and/yellow</t>" \
           "<t fg=red>lesser < or greater > or both <></t><t>bye</t>"

    if sub:
        text = re.sub(r"/>", "></t>", text)
    expected_output = ['start: ', '<t s=10 B=1>Size 10 bold</t>', '<t siz=6>Size 6\nNew Line </t>',
         '<t u w=bold>Underlined and Bolded\n</t>', '<t it>Italics</t>', '<t>default</t>',
         '<t fam="Courier New" siz=18>Courier 18</t>',
         '<t bitmap=question/>', '<t bitmap=info/>', ' and ', '<t fg=red>red/and/yellow</t>',
         "<t fg=red>lesser < or greater > or both <></t>", '<t>bye</t>', ] 
    if sub:
        expected_output = expected_output[:7] + ['<t bitmap=question></t>', '<t bitmap=info></t>'] + expected_output[9:]
    exp_len = len(expected_output)

    def test_patt1(self, i=1):
        fields = [f for f in re.split(TestClass.patt[i], TestClass.text) if f]
        print(f'\nPATT[{i}] {TestClass.patt[i]!r} gives {len(fields)} of the {TestClass.exp_len} expected fields:')
        print(pprint.pformat(fields), '\n\n')
        self.assertEqual(fields, TestClass.expected_output)

    def test_patt2(self, i=2):
        fields = [f for f in re.split(TestClass.patt[i], TestClass.text) if f]
        print(f'\nPATT[{i}] {TestClass.patt[i]!r} gives {len(fields)} of the {TestClass.exp_len} expected fields:')
        print(pprint.pformat(fields), '\n\n')
        self.assertEqual(fields, TestClass.expected_output)

    def test_patt3(self, i=3):
        fields = [f for f in re.split(TestClass.patt[i], TestClass.text) if f]
        print(f'\nPATT[{i}] {TestClass.patt[i]!r} gives {len(fields)} of the {TestClass.exp_len} expected fields:')
        print(pprint.pformat(fields), '\n\n')
        self.assertEqual(fields, TestClass.expected_output)

    def test_patt4(self, i=4):
        fields = [f for f in re.split(TestClass.patt[i], TestClass.text) if f]
        print(f'\nPATT[{i}] {TestClass.patt[i]!r} gives {len(fields)} of the {TestClass.exp_len} expected fields:')
        print(pprint.pformat(fields), '\n\n')
        self.assertEqual(fields, TestClass.expected_output)

    def test_patt5(self, i=5):
        fields = [f for f in re.split(TestClass.patt[i], TestClass.text) if f]
        print(f'\nPATT[{i}] {TestClass.patt[i]!r} gives {len(fields)} of the {TestClass.exp_len} expected fields:')
        print(pprint.pformat(fields), '\n\n')
        self.assertEqual(fields, TestClass.expected_output)

    def test_patt6(self, i=6):
        fields = [f for f in re.split(TestClass.patt[i], TestClass.text) if f]
        print(f'\nPATT[{i}] {TestClass.patt[i]!r} gives {len(fields)} of the {TestClass.exp_len} expected fields:')
        print(pprint.pformat(fields), '\n\n')
        self.assertEqual(fields, TestClass.expected_output)


if __name__ == '__main__':
    unittest.main()