Python 3.x 熊猫:在字符的开始和结束之间获取子字符串

Python 3.x 熊猫:在字符的开始和结束之间获取子字符串,python-3.x,regex,pandas,substring,Python 3.x,Regex,Pandas,Substring,我试图在不同字符的开始和结束之间获得子字符串。我尝试了几种不同的正则表达式符号,我接近我需要的输出,但它不是完全正确的。我能做些什么来解决这个问题 数据csv ID,TEST abc,1#London4#Harry Potter#5Rowling## cde,6#Harry Potter1#England#5Rowling efg,4#Harry Potter#5Rowling##1#USA ghi, jkm,4#Harry Potter5#Rowling xyz,4#Harry Potter1

我试图在不同字符的开始和结束之间获得子字符串。我尝试了几种不同的正则表达式符号,我接近我需要的输出,但它不是完全正确的。我能做些什么来解决这个问题

数据csv

ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling
代码:

尝试:

从上述代码获得输出:它不拾取结束行“1#USA”

所需产出:

1#London
1#England
1#USA
NaN
NaN
1#China
您可以尝试:

# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
输出:

0     1#London
1    1#England
2        1#USA
3          NaN
4          NaN
5      1#China
Name: TEST, dtype: object
您可以这样做:

>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
           0
0   1#London
1  1#England
2      1#USA
3        NaN
4        NaN
5    1#China

如何
df['TEST'].astype(str).str.extract('(1#.*(?=#|$| d)))
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
0     1#London
1    1#England
2        1#USA
3          NaN
4          NaN
5      1#China
Name: TEST, dtype: object
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
           0
0   1#London
1  1#England
2      1#USA
3        NaN
4        NaN
5    1#China