Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ssis/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在python中进行CamelCase拆分_Python_Regex_Camelcasing - Fatal编程技术网

如何在python中进行CamelCase拆分

如何在python中进行CamelCase拆分,python,regex,camelcasing,Python,Regex,Camelcasing,我想达到的目标是这样的: >>> camel_case_split("CamelCaseXYZ") ['Camel', 'Case', 'XYZ'] >>> camel_case_split("XYZCamelCase") ['XYZ', 'Camel', 'Case'] 所以我搜索并找到了这个: 总之,您可以说@kalefranz的解决方案与问题不匹配(见最后一个案例),而@casimir et hippolyte的解决方案占用了单个空间,因此违反了拆分不

我想达到的目标是这样的:

>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
所以我搜索并找到了这个:

总之,您可以说@kalefranz的解决方案与问题不匹配(见最后一个案例),而@casimir et hippolyte的解决方案占用了单个空间,因此违反了拆分不应改变单个部分的想法。其余两个备选方案之间的唯一区别是,我的解决方案在空字符串输入上返回一个带有空字符串的列表,而@200_success的解决方案返回一个空列表。 我不知道python社区在这个问题上的立场如何,所以我说:我对任何一个都没意见。由于200_success的解决方案更简单,我认为这是正确的答案。

python的
re.split
说:

请注意,“拆分”永远不会在空模式匹配上拆分字符串

看到这个,

>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']

下面是另一个需要更少代码且不需要复杂正则表达式的解决方案:

def camel_case_split(string):
    bldrs = [[string[0].upper()]]
    for c in string[1:]:
        if bldrs[-1][-1].islower() and c.isupper():
            bldrs.append([c])
        else:
            bldrs[-1].append(c)
    return [''.join(bldr) for bldr in bldrs]
编辑 上面的代码包含一个优化,可以避免用每个附加字符重新生成整个字符串。如果不考虑优化,一个更简单的版本(带有注释)可能看起来像

def camel_case_split2(string):
    # set the logic for creating a "break"
    def is_transition(c1, c2):
      return c1.islower() and c2.isupper()

    # start the builder list with the first character
    # enforce upper case
    bldr = [string[0].upper()]
    for c in string[1:]:
        # get the last character in the last element in the builder
        # note that strings can be addressed just like lists
        previous_character = bldr[-1][-1]
        if is_transition(previous_character, c):
            # start a new element in the list
            bldr.append(c)
        else:
            # append the character to the last string
            bldr[-1] += c
    return bldr

正如@AplusKminus所解释的,
re.split()
从不在空模式匹配上拆分。因此,您应该尝试查找感兴趣的组件,而不是拆分

下面是一个使用模拟拆分的
re.finditer()
的解决方案:

def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]
def camel\u case\u分割(标识符):

matches=finditer('.+?(?:(?大多数情况下,当您不需要检查字符串的格式时,全局研究比拆分更简单(对于相同的结果):

返回

['Camel', 'Case', 'XYZ']
要处理单峰骆驼,您也可以使用:

re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
注意:
(?=[A-Z]|$)
可以使用双重否定(带否定字符类的否定前瞻)来缩短:
(?![^A-Z])
使用
re.sub()
split()

结果

'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']

我偶然发现了这个例子,并编写了一个正则表达式来解决它。实际上,它应该适用于任何一组单词

RE_WORDS = re.compile(r'''
    # Find words in a string. Order matters!
    [A-Z]+(?=[A-Z][a-z]) |  # All upper case before a capitalized word
    [A-Z]?[a-z]+ |  # Capitalized words / all lower case
    [A-Z]+ |  # All upper case
    \d+  # Numbers
''', re.VERBOSE)
这里的关键是对第一个可能的大小写进行前瞻。它将在大写单词之前匹配(并保留)大写单词:

assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']

我认为以下是最佳选择

Def count_word(): 返回(关于findall('[A-Z]?[A-Z]+',输入('请输入您的字符串'))


Print(count_word())

我知道这个问题添加了regex的标签。但是,我总是尽量远离regex。因此,下面是我不使用regex的解决方案:

def split_camel(text, char):
    if len(text) <= 1: # To avoid adding a wrong space in the beginning
        return text+char
    if char.isupper() and text[-1].islower(): # Regular Camel case
        return text + " " + char
    elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
        return text[:-1] + " " + text[-1] + char
    else: # Do nothing part
        return text + char

text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"
def split_camel(文本,字符):

如果len(text)使用更全面的方法,它会处理一些问题,如数字、以小写字母开头的字符串、单字母单词等

def camel_case_split(identifier, remove_single_letter_words=False):
    """Parses CamelCase and Snake naming"""
    concat_words = re.split('[^a-zA-Z]+', identifier)

    def camel_case_split(string):
        bldrs = [[string[0].upper()]]
        string = string[1:]
        for idx, c in enumerate(string):
            if bldrs[-1][-1].islower() and c.isupper():
                bldrs.append([c])
            elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
                bldrs.append([c])
            else:
                bldrs[-1].append(c)

        words = [''.join(bldr) for bldr in bldrs]
        words = [word.lower() for word in words]
        return words
    words = []
    for word in concat_words:
        if len(word) > 0:
            words.extend(camel_case_split(word))
    if remove_single_letter_words:
        subset_words = []
        for word in words:
            if len(word) > 1:
                subset_words.append(word)
        if len(subset_words) > 0:
            words = subset_words
    return words
def camel\u case\u split(标识符,删除单个字母\u words=False):
“”“解析CamelCase和Snake命名”“”
concat_words=re.split(“[^a-zA-Z]+”,标识符)
def camel_case_分割(字符串):
bldrs=[[string[0].upper()]]
字符串=字符串[1:]
对于idx,枚举中的c(字符串):
如果bldrs[-1][-1].islower()和c.isupper():
bldrs.append([c])
elif c.isupper()和(idx+1)0:
words.extend(camel\u case\u split(word))
如果删除单个字母或单词:
子集_单词=[]
用文字表示:
如果len(word)>1:
子集_words.append(word)
如果len(子集_字)>0:
单词=单词的子集
回话

我的要求比OP更具体。特别是,除了处理所有OP案例外,我还需要其他解决方案不提供的以下内容: -将所有非字母数字输入(例如!@$%^&*()等)视为单词分隔符 -按如下方式处理数字: -不能在一个词的中间 -除非短语以数字开头,否则不能位于单词的开头

def splitWords(s):
    new_s = re.sub(r'[^a-zA-Z0-9]', ' ',                  # not alphanumeric
        re.sub(r'([0-9]+)([^0-9])', '\\1 \\2',            # digit followed by non-digit
            re.sub(r'([a-z])([A-Z])','\\1 \\2',           # lower case followed by upper case
                re.sub(r'([A-Z])([A-Z][a-z])', '\\1 \\2', # upper case followed by upper case followed by lower case
                    s
                )
            )
        )
    )
    return [x for x in new_s.split(' ') if x]
输出:

for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
    print test + ':' + str(splitWords(test))
工作解决方案,不带regexp 我不太擅长regexp。我喜欢在IDE中使用它们进行搜索/替换,但在程序中尽量避免使用它们

以下是纯python中非常简单的解决方案:

def camel_case_split(s):
    idx = list(map(str.isupper, s))
    # mark change of case
    l = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  # "Ul"
            l.append(i)
        elif not x and y:  # "lU"
            l.append(i+1)
    l.append(len(s))
    # for "lUl", index of "U" will pop twice, have to filer it
    return [s[x:y] for x, y in zip(l, l[1:]) if x < y]

此解决方案还支持数字、空格和自动删除下划线:

def camel_terms(value):
    return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)
一些测试:

tests = [
    "XYZCamelCase",
    "CamelCaseXYZ",
    "Camel_CaseXYZ",
    "3DCamelCase",
    "Camel5Case",
    "Camel5Case5D",
    "Camel Case XYZ"
]

for test in tests:
    print(test, "=>", camel_terms(test))
结果:

XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
简单解决方案:

re.sub(r“([a-z0-9])([a-Z])”,r“\1\2”,str(text))
重新导入

re.sub(“?其他Q做你想做的事:,我很确定还有其他Q。它是如何
ABC
CamelCase?!@Mihai我不理解你的问题。如果你想知道正则表达式在
“abccamecase”
上是如何执行的,它会像预期的那样工作:
['ABC','CamelCase']
。如果您将
ABC
解释为代表,那么我很抱歉造成混淆,因为
ABC
在我的问题中只是三个任意大写字母。请阅读。这也是一个很好的答案,但我没有发现问题,因为措辞对我的搜索来说太具体了。此外,您的答案与这里要求的内容不太相符,因为它会产生错误这是一个带有任意分隔字符的转换字符串,您需要使用
str.split(“”)
,而不是它的部分列表(更通用)。@SheridanVespo我想第一个版本可能有一个无关的
,您为我捕获并更正了:)@SheridanVespo显然有驼峰大小写。一些定义(以及我最初假设的定义)强制第一个字母大写。不用担心,“bug”是一个很容易修复的问题。在初始化列表时只需删除
.upper()
调用。你能创建一个满足中大小写的版本吗?还有,是不是
def camel_case_split(identifier, remove_single_letter_words=False):
    """Parses CamelCase and Snake naming"""
    concat_words = re.split('[^a-zA-Z]+', identifier)

    def camel_case_split(string):
        bldrs = [[string[0].upper()]]
        string = string[1:]
        for idx, c in enumerate(string):
            if bldrs[-1][-1].islower() and c.isupper():
                bldrs.append([c])
            elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
                bldrs.append([c])
            else:
                bldrs[-1].append(c)

        words = [''.join(bldr) for bldr in bldrs]
        words = [word.lower() for word in words]
        return words
    words = []
    for word in concat_words:
        if len(word) > 0:
            words.extend(camel_case_split(word))
    if remove_single_letter_words:
        subset_words = []
        for word in words:
            if len(word) > 1:
                subset_words.append(word)
        if len(subset_words) > 0:
            words = subset_words
    return words
def splitWords(s):
    new_s = re.sub(r'[^a-zA-Z0-9]', ' ',                  # not alphanumeric
        re.sub(r'([0-9]+)([^0-9])', '\\1 \\2',            # digit followed by non-digit
            re.sub(r'([a-z])([A-Z])','\\1 \\2',           # lower case followed by upper case
                re.sub(r'([A-Z])([A-Z][a-z])', '\\1 \\2', # upper case followed by upper case followed by lower case
                    s
                )
            )
        )
    )
    return [x for x in new_s.split(' ') if x]
for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
    print test + ':' + str(splitWords(test))
:[]
 :[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']
def camel_case_split(s):
    idx = list(map(str.isupper, s))
    # mark change of case
    l = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  # "Ul"
            l.append(i)
        elif not x and y:  # "lU"
            l.append(i+1)
    l.append(len(s))
    # for "lUl", index of "U" will pop twice, have to filer it
    return [s[x:y] for x, y in zip(l, l[1:]) if x < y]
def test():
    TESTS = [
        ("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
        ("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
        ("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
        ("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
        ("Ta", ['Ta']),
        ("aT", ['a', 'T']),
        ("a", ['a']),
        ("T", ['T']),
        ("", []),
        ("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
        ("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
        ("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
    ]
    for (q,a) in TESTS:
        assert camel_case_split(q) == a

if __name__ == "__main__":
    test()
def camel_terms(value):
    return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)
tests = [
    "XYZCamelCase",
    "CamelCaseXYZ",
    "Camel_CaseXYZ",
    "3DCamelCase",
    "Camel5Case",
    "Camel5Case5D",
    "Camel Case XYZ"
]

for test in tests:
    print(test, "=>", camel_terms(test))
XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
import re

re.sub('(?<=[a-z])(?=[A-Z])', ' ', 'camelCamelCAMEL').split(' ')
# ['camel', 'Camel', 'CAMEL'] <-- result

# '(?<=[a-z])' --> means preceding lowercase char (A)
# '(?=[A-Z])'  --> means following UPPERCASE char (B)
# '(A)(B)'     --> 'aA' or 'aB' or 'bA' and so on