在python中使用正则表达式从多种格式的字符串中提取字符串和数字？_Python_Regex

在python中使用正则表达式从多种格式的字符串中提取字符串和数字？

python regex

在python中使用正则表达式从多种格式的字符串中提取字符串和数字？,python,regex,Python,Regex,我试图用正则表达式解析一个字符串，正则表达式是一种特殊的格式，可以从中获得细节。我的字符串有两种格式- 第一种格式一种方法是使用foldername版本.tgz。此处foldername可以是任何格式的任何字符串。它可以包含另一个或多个-，或任何其他内容例如： hello-1234.tgz:这应该给我FolderNameashello和Versionas1234 world-12345.tgz:这应该给我FolderNameasworld和Versionas12345 hello-2123

我试图用正则表达式解析一个字符串，正则表达式是一种特殊的格式，可以从中获得细节。我的字符串有两种格式-

第一种格式

一种方法是使用

foldername版本.tgz

。此处

foldername

可以是任何格式的任何字符串。它可以包含另一个或多个

，或任何其他内容

例如：

hello-1234.tgz:这应该给我
```
FolderName
```
as
```
hello
```
和
```
Version
```
as
```
1234
```
world-12345.tgz:这应该给我
```
FolderName
```
as
```
world
```
和
```
Version
```
as
```
12345
```
hello-21234-12345。tgz:这应该给我
```
FolderName
```
as
```
hello-21234
```
和
```
Version
```
as
```
12345
```
hello-21234-a-12345。tgz:这应该给我
```
FolderName
```
as
```
hello-21234-a
```
和
```
Version
```
as
```
12345
```

第二种格式

另一种方法是使用

foldername-version-environment.tgz

。在这种情况下，

foldername

也可以是任何格式的任何字符串。另外，环境字符串只能是

dev

、

stage

、

prod

等等，所以我还需要添加检查

例如：

hello-1234-dev.tgz:这应该给我
```
FolderName
```
as
```
hello
```
和
```
Version
```
as
```
1234
```
world-12345-stage.tgz:这应该给我
```
FolderName
```
as
```
world
```
和
```
Version
```
as
```
12345
```
hello-21234-12345-prod.tgz:这应该给我
```
FolderName
```
as
```
hello-21234
```
和
```
Version
```
as
```
12345
```
hello-21234-a-12345-prod.tgz:这应该给我
```
FolderName
```
as
```
hello-21234-a
```
和
```
Version
```
as
```
12345
```

问题陈述

因此，对于上述两种格式，我需要从字符串中提取

FolderName

和

Version

。我尝试使用下面的正则表达式，但它不适用于第二种格式的字符串，但我希望我的代码适用于这两种格式

#sample example string which can be in first or second format
exampleString = hello-21234-12345-prod.tgz
build_found = re.search(r'[\d.-]+.tgz', exampleString)
version = build_found.group().replace(".tgz", "")
folderName = exampleString.split(version)[0]

我在这里做错了什么？

您需要使用正则表达式捕获字符串中要查找的组件，然后使用

.groups（）

提取捕获。这在我的测试中起了作用：

re.search(r'^(.+)-(\d+)\D*$', exampleString)

ipython中的示例：

In [1]: import re

In [2]: s1 = 'hello-21234-12345-prod.tgz'

In [3]: s2 = 'hello-1234.tgz'

In [4]: re.search(r'^(.+)-(\d+)\D*$', s1).groups()
Out[4]: ('hello-21234', '12345')

In [5]: re.search(r'^(.+)-(\d+)\D*$', s2).groups()
Out[5]: ('hello', '1234')

诀窍在于正则表达式

r'^（+）-（\d+）\d*$中的捕获组（（…）
）。有两个组-首先查看第二个捕获组，然后查看第一个捕获组，实际上更容易解码
正则表达式的第二部分-r'（\d+\d*$”
与最后一系列的\d
数字相匹配。您知道这是最后一个数字序列，因为\D*$
部分将匹配并吞掉字符串末尾的所有非数字字符
正则表达式的第一部分与第二部分之前的所有内容匹配。它捕获除“-”
字符之外的所有内容，并提供FolderName
请注意，如果在环境中或在文件结尾（例如使用bzip2压缩）中有任何数字字符，则需要更复杂的内容。
我将使用：
inp = "some text hello-21234-a-12345.tgz some more text"
parts = re.findall(r'\b([^\s-]+(?:-[^-]+)*)-(\d+)(?:-[^-]+)*\.\w+\b', inp)
print("FolderName: " + parts[0][0])
print("Version: " + parts[0][1])

这张照片是：
FolderName: hello-21234-a
Version: 12345

使用组指定阵列的不同部分。您也可以为它们命名，以便以后更容易提取：
pattern = re.compile(r"(?P<FolderName>.+)-(?P<Version>\d+)(?:-(?P<Env>dev|stage|prod))?\.tgz")

m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', 'prod')
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', 'prod')

ex2 = "hello-21234-1234.tgz" # No environment
m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', None)
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', None)

pattern=re.compile（r“（？P+）-（？P\d+）（：-（？Pdev | stage | prod））？\.tgz”）
m=模式匹配（ex）
打印（m.groups（））
#（'hello-21234'、'12345'、'prod'）
打印（m.group（'FolderName'）、m.group（'Version'）、m.group（'Env'））
#（'hello-21234'、'12345'、'prod'）
ex2=“hello-21234-1234.tgz”#无环境
m=模式匹配（ex）
打印（m.groups（））
#（'hello-21234'，'12345'，无）
打印（m.group（'FolderName'）、m.group（'Version'）、m.group（'Env'））
#（'hello-21234'，'12345'，无）
查看此模式是否有效
import re
exampleString = 'hello-21234-12345-prod.tgz'
build_found = re.search(r'([\w-]+)-(\d+)-(dev|stage|prod)?', exampleString)

folder_name = build_found[1]
version = build_found[2]
environment = build_found[3]

print(folder_name)
print(version)
print(environment)

输出
hello-21234
12345
prod

当然不是最好的方法，但这里有一个想法
首先确定您是否有第一个或第二个案例
-(dev|stage|prod)\.tgz$

这个正则表达式将确定您是否有案例1或案例2
如果是案例1，则可以使用以下命令提取foldername：
.*-

您可以使用以下方法提取版本：
-\d+.tgz$

如果是案例2，则可以使用以下命令提取组合的foldername/versionnumber：
.*-

-\d+

从那里，您可以使用（再次）提取foldername：
版本号为：
.*-

-\d+

版本是否始终为整数？是版本始终为inetegr