在python中使用正则表达式拆分罗马数字_Python_Regex

在python中使用正则表达式拆分罗马数字

python regex

在python中使用正则表达式拆分罗马数字,python,regex,Python,Regex,我需要在罗马数字上拆分文本。这是我的文本 This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one 事实上，这是一个问题的一部分。我多么希望它被分解如下。 Thi

我需要在罗马数字上拆分文本。
这是我的文本

This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one

事实上，这是一个问题的一部分。我多么希望它被分解如下。

This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

在这里，我想要的是，把句子除以罗马数字。
这是我写的正则表达式

text = This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one
for m in re.split(r' [a-z]+\. ',text):
    print(m)

这就是我得到的

This is the part (a) of question number one.
i. This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

我的表达式适用于罗马数字2和3，但不适用于罗马数字1。
因此，我需要一个适用于任何罗马数字的通用表达式。
需要注意的重要一点是，在罗马数字之前有一个空格，在罗马数字之后有一个句号，然后是空格。
有人能帮我解决这个问题吗

正则表达式捕获子字符串

一个。

，请尝试以以下方式更改它：

text = 'This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one'

for m in re.split(r' [MDCLXVI]+\. ', text, flags=re.IGNORECASE):
    print(m)

这不是我得到的。再次检查您的第一行。我明白了

This is the part (a) of question number

因为你的正则表达式匹配“一”

适合我。

如果你想要正确的罗曼内特数字（小写的罗马数字通常被称为“罗曼内特”），它们很容易生成。MarkPilgrim在《深入Python》一书中提供了多种罗马数字实用程序，其中一些可以看到

生成人名数字的那个：

class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass

def toRoman(n):
    """convert integer to Roman numeral"""
    if not (0 < n < 5000):
        raise OutOfRangeError, "number out of range (must be 1..4999)"
    if int(n) != n:
        raise NotIntegerError, "decimals can not be converted"
    romanNumeralMap = (('M',  1000), ('CM', 900), ('D',  500), ('CD', 400), ('C',  100), ('XC', 90),
       ('L',  50), ('XL', 40), ('X',  10), ('IX', 9), ('V',  5), ('IV', 4), ('I',  1))
    result = ""
    for numeral, integer in romanNumeralMap:
        while n >= integer:
            result += numeral
            n -= integer
    return result

可用于为最多20个罗马数字生成模式，并将其放入正则表达式：

>>> pat=' (?:'+'|'.join([int_to_roman(i).lower() for i in range(1,21)])+')\. '
>>> pat
' (?:i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\\. '

然后您可以拆分文本：

>>> print '\n'.join(re.split(pat, txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

或者，您可以在

re.split

中使用：

>>> pat=re.compile('''\
... [ ]                 # one space
... m{0,4}              # thousands - 0 to 4 M's
... (?:cm|cd|d?c{0,3})  # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
...                     #            or 500-800 (D, followed by 0 to 3 C's)
... (?:xc|xl|l?x{0,3})  # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
...                     #        or 50-80 (L, followed by 0 to 3 X's)
... (?:ix|iv|v?i{0,3})  # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
...                     #        or 5-8 (V, followed by 0 to 3 I's)
... [.][ ]                # full stop then a space''', re.X)
>>> print '\n'.join(pat.split(txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

哇..效果很好..你能告诉我我的代码出了什么问题吗..@Punuth你的正则表达式也捕获了子字符串'one'。

>>> print '\n'.join(re.split(pat, txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

>>> pat=re.compile('''\
... [ ]                 # one space
... m{0,4}              # thousands - 0 to 4 M's
... (?:cm|cd|d?c{0,3})  # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
...                     #            or 500-800 (D, followed by 0 to 3 C's)
... (?:xc|xl|l?x{0,3})  # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
...                     #        or 50-80 (L, followed by 0 to 3 X's)
... (?:ix|iv|v?i{0,3})  # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
...                     #        or 5-8 (V, followed by 0 to 3 I's)
... [.][ ]                # full stop then a space''', re.X)
>>> print '\n'.join(pat.split(txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one