Python 3.x 熊猫:在字符的开始和结束之间获取子字符串
我试图在不同字符的开始和结束之间获得子字符串。我尝试了几种不同的正则表达式符号,我接近我需要的输出,但它不是完全正确的。我能做些什么来解决这个问题 数据csvPython 3.x 熊猫:在字符的开始和结束之间获取子字符串,python-3.x,regex,pandas,substring,Python 3.x,Regex,Pandas,Substring,我试图在不同字符的开始和结束之间获得子字符串。我尝试了几种不同的正则表达式符号,我接近我需要的输出,但它不是完全正确的。我能做些什么来解决这个问题 数据csv ID,TEST abc,1#London4#Harry Potter#5Rowling## cde,6#Harry Potter1#England#5Rowling efg,4#Harry Potter#5Rowling##1#USA ghi, jkm,4#Harry Potter5#Rowling xyz,4#Harry Potter1
ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling
代码:
尝试:
从上述代码获得输出:它不拾取结束行“1#USA”
所需产出:
1#London
1#England
1#USA
NaN
NaN
1#China
您可以尝试:
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
输出:
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
Name: TEST, dtype: object
您可以这样做:
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
0
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
如何
df['TEST'].astype(str).str.extract('(1#.*(?=#|$| d)))
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
Name: TEST, dtype: object
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
0
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China