Python 使用startswith和index从结构化字符串中查找子字符串
我试图创建一个代码,从结构化字符串中查找子字符串(可以是数字或其他任何内容)。 字符串的结构(2种可能性)如下所示:Python 使用startswith和index从结构化字符串中查找子字符串,python,Python,我试图创建一个代码,从结构化字符串中查找子字符串(可以是数字或其他任何内容)。 字符串的结构(2种可能性)如下所示: 字符串=“1x子字符串3x 4x” 字符串=“4x 3x子字符串1x” x可以是任何字符 子字符串格式类似于'pos.2' 正常情况下使用下面的代码,但现在我也要考虑特殊情况。我已经尝试过: I. StastSub((‘3’,‘4’))< /代码>,但没有用。 字符串1-8应该用一个简单的例子来解释逻辑 字符串9-10显示了一个复杂的示例。 代码应提取位置2、5和7处的子
“1x子字符串3x 4x”
“4x 3x子字符串1x”
可以是任何字符x
格式类似于子字符串
'pos.2'
正常情况下使用下面的代码,但现在我也要考虑特殊情况。我已经尝试过:<代码> I. StastSub((‘3’,‘4’))< /代码>,但没有用。 字符串1-8应该用一个简单的例子来解释逻辑
字符串9-10显示了一个复杂的示例。 代码应提取位置2、5和7处的子字符串 我希望您能帮助为所有字符串/特殊情况找到解决方案,使所有情况下的clean:80
<代码>:-)
字符串9:
clean:pos2='$80'pos5='75.000 kg'pos7='22秒'
#str1-8 easy example strings
#normal string
str1 = '1x 80 3x 4x'
str2 = '4x 3x 80 1x'
# missing number/pos. 3
str3 = '1x 80 4x '
str4 = '4x 80 1x'
str3a = '1. A $67 4. A 69.000kg 6. A 12sec 8. B 9. B'
result: `clean: pos2 = '$67' pos5 = '69.000kg' pos7 = '12sec'`
str4a = '9B 8B 22 sec 6A 75.000kg 4b $80 1b'
result: `clean: pos2 ='$80' pos5 = '75.000 kg' pos7 = '22 sec'`
# missing number/pos. 1, => number is at the start or end of the string
str5 = '80 3x 4x'
str6 = '4x 3x 80'
str5a = '10 Mrd 3: A 4: A 50 .379 6: A 7:19 8: B 9: D '
result: clean: pos2= 10 Mrd, pos5= 50,379 pos7=7:19 (or just 19 in raw string without 7: if its easier)
str6a = ' 9a 8b 10 6b 60000 4a 3 b 50 '
result: clean: pos2= 50, pos5= 60000 pos7=10
# Optional (rare case)
# missing number/pos. 1 and 3
str7 = '80 4x'
str8 = '4x 80'
str7a = '10 Mrd 4: A 50 .379 6: A 7:19 8: B 9: D '
result: clean: pos2= 10 Mrd, pos5= 50,379 pos7=7:19 (or just 19 in raw string without 7: if its easier)
str8a = ' 9a 8b 10 6b 60000 4a 50 '
result: clean: pos2= 50, pos5= 60000 pos7=10
# complex realistic strings
str9 = '9B 8B 22 sec 6A 75.000kg 4b 3b $80 1b'
str10 = '1. A $67 3. A 4. A 69.000kg 6. A 12sec 8. B 9. B'
# missing number/pos. 4 or 6 (Pos6 Optional, cause thats difficult i guess)
str11 = '1. A $67 3. A 69.000kg 6. A 12sec 8. B 9. B'
result: `clean: pos2 = '$67' pos5 = '69.000kg' pos7 = '12sec'
str12 = '1. A $67 3. A 4a 69.000kg 12sec 8. B 9. B'
result: `clean: pos2 = '$67' pos5 = '69.000kg' pos7 = '12sec'
x_list = [str1,str2,str3,str4,str5,str6,str7,str8, str9, str10, str11,str12]
for x in x_list:
print ("raw "+x)
values = ['1x', '3x', '4x']
try:
for i in values:
if i.startswith('3') :
foo=i
if i.startswith("1") :
baa=i
start=x.index(foo) + len( foo )
end=x.index(baa)
if start < end:
pass
number = x[start:end].strip(' ')
else:
start=x.index(baa) + len( baa )
end=x.index(foo)
number = x[start:end].strip(' ')
except:
number ='0'
print ("clean "+number)
字符串10:
clean:pos2='$67'pos5='69.000kg'pos7='12秒'
#str1-8 easy example strings
#normal string
str1 = '1x 80 3x 4x'
str2 = '4x 3x 80 1x'
# missing number/pos. 3
str3 = '1x 80 4x '
str4 = '4x 80 1x'
str3a = '1. A $67 4. A 69.000kg 6. A 12sec 8. B 9. B'
result: `clean: pos2 = '$67' pos5 = '69.000kg' pos7 = '12sec'`
str4a = '9B 8B 22 sec 6A 75.000kg 4b $80 1b'
result: `clean: pos2 ='$80' pos5 = '75.000 kg' pos7 = '22 sec'`
# missing number/pos. 1, => number is at the start or end of the string
str5 = '80 3x 4x'
str6 = '4x 3x 80'
str5a = '10 Mrd 3: A 4: A 50 .379 6: A 7:19 8: B 9: D '
result: clean: pos2= 10 Mrd, pos5= 50,379 pos7=7:19 (or just 19 in raw string without 7: if its easier)
str6a = ' 9a 8b 10 6b 60000 4a 3 b 50 '
result: clean: pos2= 50, pos5= 60000 pos7=10
# Optional (rare case)
# missing number/pos. 1 and 3
str7 = '80 4x'
str8 = '4x 80'
str7a = '10 Mrd 4: A 50 .379 6: A 7:19 8: B 9: D '
result: clean: pos2= 10 Mrd, pos5= 50,379 pos7=7:19 (or just 19 in raw string without 7: if its easier)
str8a = ' 9a 8b 10 6b 60000 4a 50 '
result: clean: pos2= 50, pos5= 60000 pos7=10
# complex realistic strings
str9 = '9B 8B 22 sec 6A 75.000kg 4b 3b $80 1b'
str10 = '1. A $67 3. A 4. A 69.000kg 6. A 12sec 8. B 9. B'
# missing number/pos. 4 or 6 (Pos6 Optional, cause thats difficult i guess)
str11 = '1. A $67 3. A 69.000kg 6. A 12sec 8. B 9. B'
result: `clean: pos2 = '$67' pos5 = '69.000kg' pos7 = '12sec'
str12 = '1. A $67 3. A 4a 69.000kg 12sec 8. B 9. B'
result: `clean: pos2 = '$67' pos5 = '69.000kg' pos7 = '12sec'
x_list = [str1,str2,str3,str4,str5,str6,str7,str8, str9, str10, str11,str12]
for x in x_list:
print ("raw "+x)
values = ['1x', '3x', '4x']
try:
for i in values:
if i.startswith('3') :
foo=i
if i.startswith("1") :
baa=i
start=x.index(foo) + len( foo )
end=x.index(baa)
if start < end:
pass
number = x[start:end].strip(' ')
else:
start=x.index(baa) + len( baa )
end=x.index(foo)
number = x[start:end].strip(' ')
except:
number ='0'
print ("clean "+number)
如果我正确地理解了你的目标,那么在我看来,你似乎过于复杂化了。我编写了一个函数,将输入字符串拆分为一个列表,并检查每个段是否符合
1x
、2x
或3x
的格式。检查一下,如果不是你需要的,请告诉我
#我们使用regex检查与格式的匹配
进口稀土
#字符串列表
x_列表=[“1x 80 3x 4x”,“4x 3x 80 1x”]
对于x_列表中的x:
打印(查找子文件(x))
def find_substr(x):
#将空格分成一个列表
seg=x.split(“”)
#检查每个单词的所需格式
对于范围内的i(len(seg)):
对于[1,3,4]中的j:
如果重新搜索(str(j)+”,seg[i])为无:
#这个单词不符合格式,所以它是子字符串
返回段
将代码稍微修改了一点,以提高可读性。我不知道这是否是你想要的,但它起作用了。如果你有任何问题,请告诉我。我很乐意帮忙
for x in x_list:
print ("raw "+x)
try:
# splits the string into a list, separating on spaces (e.g ['1x', '80', '3x', '4x'])
y = x.split(" ")
# a is the substring that you are checking in the list
a = '80'
if a in y:
index = y.index(a)
number = y[index]
except:
number ='0'
print ("clean "+number)
这看起来像是一份工作
所以这里主要的事情是识别那些结构已知的位置标记,对于正则表达式,我们检查它是否是一个单一的数字([1-9]
),然后是一个字母(
或:
或
)((?:[\.\:]?)
),然后是一个字母([a-zA-Z]
)然后是另一个空格或字符串的结尾((?:|$)
)。(?:…)
表示该组不是捕获组,有关这些组的更多详细信息,请查看上面链接的文档
我们在re.split
中使用它将文本分割为匹配部分和不匹配部分,然后从它们周围的空格中去掉字符并过滤掉那些原来是空的
如果它们是匹配的字符串,则标识它们的位置;如果不是,则标识它们的位置
然后是几个简单的检查,比如按照他们来的顺序,如果需要的话将其颠倒,所以我们总是按照相同的顺序返回,并在final
中提取我们需要的内容,检查最终案例,并相应地调整和完成
还有一个小测试
text="""1. A $67 4. A 69.000kg 6. A 12sec 8. B 9. B
9B 8B 22 sec 6A 75.000kg 4b $80 1b
10 Mrd 3: A 4: A 50 .379 6: A 7:19 8: B 9: D
9a 8b 10 6b 60000 4a 3 b 50
10 Mrd 4: A 50 .379 6: A 7:19 8: B 9: D
9a 8b 10 6b 60000 4a 50
9B 8B 22 sec 6A 75.000kg 4b 3b $80 1b
1. A $67 3. A 69.000kg 6. A 12sec 8. B 9. B
1. A $67 3. A 4a 69.000kg 12sec 8. B 9. B
9a 8b 6b 4a 3 b 50 1b""".splitlines()
for t in text:
print(f"raw: {t!r}\nresult: ",extrator(t) )
print()
这给了我们
raw: '1. A $67 4. A 69.000kg 6. A 12sec 8. B 9. B'
result: ['$67', '69.000kg', '12sec']
raw: '9B 8B 22 sec 6A 75.000kg 4b $80 1b'
result: ['$80', '75.000kg', '22 sec']
raw: '10 Mrd 3: A 4: A 50 .379 6: A 7:19 8: B 9: D'
result: ['10 Mrd', '50 .379', '7:19']
raw: '9a 8b 10 6b 60000 4a 3 b 50'
result: ['50', '60000', '10']
raw: '10 Mrd 4: A 50 .379 6: A 7:19 8: B 9: D'
result: ['10 Mrd', '50 .379', '7:19']
raw: '9a 8b 10 6b 60000 4a 50'
result: ['50', '60000', '10']
raw: '9B 8B 22 sec 6A 75.000kg 4b 3b $80 1b'
result: ['$80', '75.000kg', '22 sec']
raw: '1. A $67 3. A 69.000kg 6. A 12sec 8. B 9. B'
result: ['$67', '69.000kg', '12sec']
raw: '1. A $67 3. A 4a 69.000kg 12sec 8. B 9. B'
result: ['$67', '69.000kg', '12sec']
raw: '9a 8b 6b 4a 3 b 50 1b'
result: ['50', None, None]
更新2
以下是一个版本,该版本确定了我们获得的数据,并给出了一些假设,例如:
- 只有位置标记和数据,数据只有位置2、5和7
- 前面的正则表达式可以识别这些位置标记
- 任何人都可能失踪
- 并且数据中没有空格字符,因此,如果任何相关位置标记丢失,并且发现的数据少于预期,则可以将其中一个标记分组到提取的数据点之一,从而可以安全地进行分割,如果不是这样,则相应地调整这些部分
def extrator(rawtext):
fil = filter(None,map(str.strip,re.split(POSRE,rawtext)))
proc = [(x,int(x[0]) if re.match(POSRE,x) else None) for x in fil] #process raw data
pos = [p for x,p in proc if p is not None ] #position markers presents
if sorted(pos)!=pos:
proc = list(reversed(proc))
data = [x for x,p in proc if p is None]
pos = {p:i for i,(x,p) in enumerate(proc) if p is not None } #pos marker:index of it
#print(f"{proc=}")
if len(data)==3:
return dict(zip((2,5,7),data))
#from here, a,b,c will represent data in position 2,5 and 7 respectively
elif len(data)==2:
a,b = data
#c = None
if 3 in pos or 4 in pos:
if 6 in pos:
#one of 2, 5 or 7 is missing
i = proc.index( (a,None) )
i34 = pos[3] if 3 in pos else pos[4]
if i < i34:
#a is 2, b is 5 or 7
j = proc.index( (b,None) )
if j < pos[6]:
#7 is missing
c = None
else:
#5 is missing
b,c = None,b
else:
#2 is missing, a is 5 thus b is 7
a,b,c = None,a,b
else:
#a is 2, b may be 5 or 7 or both
t = b.split()
if len(t) == 2:
#b was both
b,c = t
elif len(t) == 1:
#b is 5 or 7
print("either 5 or 7 is missing, picked 7 as missing")
c = None
else:
#b was split into more than 2 parts
raise RuntimeError("unknow case 1")
else:
#3 and 4 are missing
if 6 in pos:
#a may be 2 or 5 or both, b is 7
c = b
t = a.split()
if len(t) == 2:
#a was both
a,b = t
elif len(t) == 1:
print("either 2 or 5 is missing, picked 5 as missing")
b = None
else:
#a was split into more than 2 parts
raise RuntimeError("unknow case 2")
else:
raise RuntimeError("Fatal error: 2 data points with no marker in between")
return dict(zip((2,5,7),(a,b,c)))
elif len(data)==1:
a = data[0]
i = proc.index( (a,None) )
#b,c = None, None
if 3 in pos or 4 in pos:
i34 = pos[3] if 3 in pos else pos[4]
if 6 in pos:
#only one of 2,5 or 7 are present
if i < i34:
#a is 2 the rest is missing
b,c = None, None
elif i < pos[6]:
#a is 5
a,b,c = None, a, None
else:
#a is 7
a,b,c = None, None, a
else:
#a is 2 or a is 5 or 7 or both
if i < i34:
#a is 2, the rest is missing
b,c = None, None
else:
#2 is missing, a is 5 or 7 or both
a,b = None, a
t = b.split()
if len(t) == 2:
b,c = t
elif len(t) == 1:
print("either 5 or 7 is missing, picked 7 as missing")
c = None
else:
raise RuntimeError("unknow case 3")
else:
#3 and 4 are missing
if 6 in pos:
if pos[6] < i:
#a is 7, the rest is missing
a,b,c = None, None, a
else:
#7 is missing, a is 2 or 5 or both
c = None
t = a.split()
if len(t) == 2:
a,b = t
elif len(t) == 1:
print("either 2 or 5 is missing, picked 5 as missing")
b = None
else:
raise RuntimeError("unknow case 4")
else:
#a is 2, 5 or 7 or any combination of them
t = a.split()
if len(t) == 3:
a,b,c = t
elif len(t) == 2:
print("one of 2, 5 or 7 is missing, picked 7 as missing")
a,b = t
c = None
elif len(t) == 1:
print("only one of 2, 5 or 7 is present, picked 2 as present")
b,c = None, None
else:
raise RuntimeError("unknow case 5")
return dict(zip((2,5,7),(a,b,c)))
elif len(data) == 0:
return dict.fromkeys( (2,5,7) )
else:
raise RuntimeError("unknow case 6: more than 3 data points")
def test():
text="""1. A $67 4. A 69.000kg 6. A 12sec 8. B 9. B
9B 8B 22 sec 6A 75.000kg 4b $80 1b
10 Mrd 3: A 4: A 50 .379 6: A 7:19 8: B 9: D
9a 8b 10 6b 60000 4a 3 b 50
10 Mrd 4: A 50 .379 6: A 7:19 8: B 9: D
9a 8b 10 6b 60000 4a 50
9B 8B 22 sec 6A 75.000kg 4b 3b $80 1b
1. A $67 3. A 69.000kg 6. A 12sec 8. B 9. B
1. A $67 3. A 4a 69.000kg 12sec 8. B 9. B
9a 8b 6b 4a 3 b 50 1b
9 a 8b 6 b 55 4a 3 b 1b
9a 8:b 777 6 b 4.a 3 b 1b
9a 8:b 777 6 b 4.a 3 b 55 1b
""".splitlines()
for t in text:
print(f"raw: {t!r}\nresult: ",extrator(t) )
print()
谢谢你的及时回复。这是一个很好的反对意见。但是我让代码变得如此复杂,因为实际的字符串也要复杂得多。意思是:在不同但定义的位置有几个子字符串。此外,指示器的结构各不相同,即1x可以是“1a”或“1a”或“1:a”。只有号码是一样的。谢谢你的快速回复。请看我对另一个答案的评论。你的代码对于这个例子来说很好。但不幸的是,我有一个复杂的字符串在现实中。我希望你仍能找到解决办法。谢谢你的努力谢谢你的工作。我的例子似乎很容易说明我需要什么。对不起。我在问题中添加了字符串9和10作为现实字符串,以使其更加清晰。我在一步之前提取了
值=['1x',3x',4x']
。这就是为什么我使用startswith()将其与字符串进行比较。也许这不是最好的办法。但对我来说,这接近最终解决方案。考虑到现实的字符串。:-)好的,你想从现实例子中得到什么结果?好的,我添加了现实例子的结果。记住,它应该像这样工作。对于字符串位置:2=>如果缺少位置3,则使用位置4作为指示。或者位置1缺少字符串的使用开始/结束。可选(如果可能):对于位置5的字符串,类似:=>如果位置4缺失,则使用位置3作为指示,如果位置6缺失,则使用位置7。(但我想这很难解决)。感谢您的时间,您是否可以为其他每个案例及其预期结果添加现实的示例?稍后我会再试一次。是的,你说得对。我需要确定每个都是哪一个。类似结果:($50 0 75.000kg 0)如果缺少值。但我明白你的意思。我得去肛门
def extrator(rawtext):
fil = filter(None,map(str.strip,re.split(POSRE,rawtext)))
proc = [(x,int(x[0]) if re.match(POSRE,x) else None) for x in fil] #process raw data
pos = [p for x,p in proc if p is not None ] #position markers presents
if sorted(pos)!=pos:
proc = list(reversed(proc))
data = [x for x,p in proc if p is None]
pos = {p:i for i,(x,p) in enumerate(proc) if p is not None } #pos marker:index of it
#print(f"{proc=}")
if len(data)==3:
return dict(zip((2,5,7),data))
#from here, a,b,c will represent data in position 2,5 and 7 respectively
elif len(data)==2:
a,b = data
#c = None
if 3 in pos or 4 in pos:
if 6 in pos:
#one of 2, 5 or 7 is missing
i = proc.index( (a,None) )
i34 = pos[3] if 3 in pos else pos[4]
if i < i34:
#a is 2, b is 5 or 7
j = proc.index( (b,None) )
if j < pos[6]:
#7 is missing
c = None
else:
#5 is missing
b,c = None,b
else:
#2 is missing, a is 5 thus b is 7
a,b,c = None,a,b
else:
#a is 2, b may be 5 or 7 or both
t = b.split()
if len(t) == 2:
#b was both
b,c = t
elif len(t) == 1:
#b is 5 or 7
print("either 5 or 7 is missing, picked 7 as missing")
c = None
else:
#b was split into more than 2 parts
raise RuntimeError("unknow case 1")
else:
#3 and 4 are missing
if 6 in pos:
#a may be 2 or 5 or both, b is 7
c = b
t = a.split()
if len(t) == 2:
#a was both
a,b = t
elif len(t) == 1:
print("either 2 or 5 is missing, picked 5 as missing")
b = None
else:
#a was split into more than 2 parts
raise RuntimeError("unknow case 2")
else:
raise RuntimeError("Fatal error: 2 data points with no marker in between")
return dict(zip((2,5,7),(a,b,c)))
elif len(data)==1:
a = data[0]
i = proc.index( (a,None) )
#b,c = None, None
if 3 in pos or 4 in pos:
i34 = pos[3] if 3 in pos else pos[4]
if 6 in pos:
#only one of 2,5 or 7 are present
if i < i34:
#a is 2 the rest is missing
b,c = None, None
elif i < pos[6]:
#a is 5
a,b,c = None, a, None
else:
#a is 7
a,b,c = None, None, a
else:
#a is 2 or a is 5 or 7 or both
if i < i34:
#a is 2, the rest is missing
b,c = None, None
else:
#2 is missing, a is 5 or 7 or both
a,b = None, a
t = b.split()
if len(t) == 2:
b,c = t
elif len(t) == 1:
print("either 5 or 7 is missing, picked 7 as missing")
c = None
else:
raise RuntimeError("unknow case 3")
else:
#3 and 4 are missing
if 6 in pos:
if pos[6] < i:
#a is 7, the rest is missing
a,b,c = None, None, a
else:
#7 is missing, a is 2 or 5 or both
c = None
t = a.split()
if len(t) == 2:
a,b = t
elif len(t) == 1:
print("either 2 or 5 is missing, picked 5 as missing")
b = None
else:
raise RuntimeError("unknow case 4")
else:
#a is 2, 5 or 7 or any combination of them
t = a.split()
if len(t) == 3:
a,b,c = t
elif len(t) == 2:
print("one of 2, 5 or 7 is missing, picked 7 as missing")
a,b = t
c = None
elif len(t) == 1:
print("only one of 2, 5 or 7 is present, picked 2 as present")
b,c = None, None
else:
raise RuntimeError("unknow case 5")
return dict(zip((2,5,7),(a,b,c)))
elif len(data) == 0:
return dict.fromkeys( (2,5,7) )
else:
raise RuntimeError("unknow case 6: more than 3 data points")
def test():
text="""1. A $67 4. A 69.000kg 6. A 12sec 8. B 9. B
9B 8B 22 sec 6A 75.000kg 4b $80 1b
10 Mrd 3: A 4: A 50 .379 6: A 7:19 8: B 9: D
9a 8b 10 6b 60000 4a 3 b 50
10 Mrd 4: A 50 .379 6: A 7:19 8: B 9: D
9a 8b 10 6b 60000 4a 50
9B 8B 22 sec 6A 75.000kg 4b 3b $80 1b
1. A $67 3. A 69.000kg 6. A 12sec 8. B 9. B
1. A $67 3. A 4a 69.000kg 12sec 8. B 9. B
9a 8b 6b 4a 3 b 50 1b
9 a 8b 6 b 55 4a 3 b 1b
9a 8:b 777 6 b 4.a 3 b 1b
9a 8:b 777 6 b 4.a 3 b 55 1b
""".splitlines()
for t in text:
print(f"raw: {t!r}\nresult: ",extrator(t) )
print()
>>> test()
raw: '1. A $67 4. A 69.000kg 6. A 12sec 8. B 9. B'
result: {2: '$67', 5: '69.000kg', 7: '12sec'}
raw: '9B 8B 22 sec 6A 75.000kg 4b $80 1b'
result: {2: '$80', 5: '75.000kg', 7: '22 sec'}
raw: '10 Mrd 3: A 4: A 50 .379 6: A 7:19 8: B 9: D'
result: {2: '10 Mrd', 5: '50 .379', 7: '7:19'}
raw: '9a 8b 10 6b 60000 4a 3 b 50'
result: {2: '50', 5: '60000', 7: '10'}
raw: '10 Mrd 4: A 50 .379 6: A 7:19 8: B 9: D'
result: {2: '10 Mrd', 5: '50 .379', 7: '7:19'}
raw: '9a 8b 10 6b 60000 4a 50'
result: {2: '50', 5: '60000', 7: '10'}
raw: '9B 8B 22 sec 6A 75.000kg 4b 3b $80 1b'
result: {2: '$80', 5: '75.000kg', 7: '22 sec'}
raw: '1. A $67 3. A 69.000kg 6. A 12sec 8. B 9. B'
result: {2: '$67', 5: '69.000kg', 7: '12sec'}
raw: '1. A $67 3. A 4a 69.000kg 12sec 8. B 9. B'
result: {2: '$67', 5: '69.000kg', 7: '12sec'}
raw: '9a 8b 6b 4a 3 b 50 1b'
result: {2: '50', 5: None, 7: None}
raw: '9 a 8b 6 b 55 4a 3 b 1b'
result: {2: None, 5: '55', 7: None}
raw: '9a 8:b 777 6 b 4.a 3 b 1b'
result: {2: None, 5: None, 7: '777'}
raw: '9a 8:b 777 6 b 4.a 3 b 55 1b'
result: {2: '55', 5: None, 7: '777'}
>>>