Python 从DataFrame中列内的列表中选择部分字符串
我有一些数据帧:Python 从DataFrame中列内的列表中选择部分字符串,python,regex,pandas,Python,Regex,Pandas,我有一些数据帧: d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']} df = pd.DataFrame(data=d) df fruit values 0 apple apple_1_0,peach_1_5 1 pear pear_1_3 2 peach mango
d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)
df
fruit values
0 apple apple_1_0,peach_1_5
1 pear pear_1_3
2 peach mango_1_0,banana_1_0,pineapple_1_10
“值”列中的字符串是逗号分隔的,我想要包含子字符串“\u 1\u 0”的字符串
期望输出:
类似这样的内容有点接近我正在尝试的内容,但在~100000行上速度非常慢:
for row in range(len(df)):
print([zero for zero in df['values'].str.split(',', expand=False)[row] if "_1_0" in zero])
['apple_1_0']
[]
['mango_1_0', 'banana_1_0']
让我们试试爆炸
简单的解决方案: 将numpy作为np导入 作为pd进口熊猫 d={'水果':['苹果','梨','桃', “值”:苹果、桃子、梨、芒果、香蕉、菠萝 df=pd.DataFramedata=d 新数据=df['values'].str.split',' new_data=new_data.applylambda lst:[如果元素中有“\u 1\u 0”,则元素表示lst中的元素] new_data=new_data.str.join, new_data=new_data.replace,np.NaN
这是一个备选方案,作为列表理解:
df["values"] = [ ",".join(entry if entry.endswith("1_0")
else ""
for entry in val.split(","))
.rstrip(",")
for val in df["values"]
]
df = df.replace({"": np.nan})
df
fruit values
0 apple apple_1_0
1 pear NaN
2 peach mango_1_0,banana_1_0
使用findall,您可以执行以下操作:
import numpy as np
import pandas as pd
d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)
df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(','.join).replace('', np.NaN)
print ( df )
Regex[^,]*_1_0?=,|$匹配以_1_0结尾,后跟逗号或字符串结尾的非逗号字符串
我们也可以使用lambda:
df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(lambda items: ','.join(items) if len(items) > 0 else np.NaN)
你说得对,我忘了分开的部分。很好。您的正则表达式中是否需要单词boundary\b?我已将其更改为[^,]*以使其与连字号单词匹配,如banana-fruit\u 1\u 0
import numpy as np
import pandas as pd
d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)
df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(','.join).replace('', np.NaN)
print ( df )
fruit values
0 apple apple_1_0
1 pear NaN
2 peach mango_1_0,banana_1_0
df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(lambda items: ','.join(items) if len(items) > 0 else np.NaN)