Python 为凌乱的数据替换部分字符串(以更快的方式代替字符串替换)?
我想替换产品变体的许多值Python 为凌乱的数据替换部分字符串(以更快的方式代替字符串替换)?,python,regex,pandas,fasttext,Python,Regex,Pandas,Fasttext,我想替换产品变体的许多值 Big Ben Personalized Products AVENGERS – Stark / 2 set 2 BigBen Personalized Products Expendables – Statham / 2 set 2 BigBen Personalized Toy 20.00
Big Ben Personalized Products AVENGERS – Stark / 2 set 2
BigBen Personalized Products Expendables – Statham / 2 set 2
BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 5 set 2
BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 3 set 1
Personalized Toy 5 set 1
BIG BEN Personalized Machine 20.00% Off Auto renew (Versand jeden 3 Monate) Kids Toy / 3 set 1
BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Kids Toy / 5 set 1
BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Adults Toy / 5 set 1
BigBen Personalized Products 20.00% Off Auto renew (Versand jeden 5 Monate) Adults Toy / 5 set
有许多产品变体实际上具有相同的值
我想知道是否有比使用更快捷的方法:
df["product_variant"]= df["product_variant"].str.replace('BigBen Personalized', '',case = False)
df["product_variant"]= df["product_variant"].str.replace('Big Ben Personalized ', '',case = False)
df["product_variant"]= df["product_variant"].str.replace('BigBen Personalized', '',case = False)
df["product_variant"]= df["product_variant"].str.replace('Auto renew', '',case = False)
一个选项是为这些示例创建一个带有2个捕获组的特定模式 对于大多数项目,在
产品
之后或成人
或儿童
- 在组1中捕获
之前的零件(如果存在)/
- 在第2组中捕获1或数字后接
set
^(?:big\s*ben personalized (?:products\s+)?(?:.*?(?=Adult|Kids))?|personalized\s+)(\w+(?: \w+)*(?: – \w+(?: \w+)*)?)(?: /)? (\d+ set)\b.*
在替换中使用2个捕获组\1(\2)
输出
product_variant
0 AVENGERS – Stark (2 set)
1 Expendables – Statham (2 set)
2 Adults Toy (5 set)
3 Adults Toy (3 set)
4 Toy (5 set)
5 Kids Toy (3 set)
6 Kids Toy (5 set)
7 Adults Toy (5 set)
8 Adults Toy (5 set)
答案出来了吗?
import pandas as pd
regex = r"^Event:\s+Task_(\d+)Error:(NO_ERROR|ERROR_(?:MINOR|\d+))(?:\w+:(\w+))?"
items = [
"Big Ben Personalized Products AVENGERS – Stark / 2 set 2",
"BigBen Personalized Products Expendables – Statham / 2 set 2",
"BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 5 set 2",
"BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 3 set 1",
"Personalized Toy 5 set 1",
"BIG BEN Personalized Machine 20.00% Off Auto renew (Versand jeden 3 Monate) Kids Toy / 3 set 1",
"BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Kids Toy / 5 set 1",
"BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Adults Toy / 5 set 1",
"BigBen Personalized Products 20.00% Off Auto renew (Versand jeden 5 Monate) Adults Toy / 5 set "
]
df = pd.DataFrame(items, columns=["product_variant"])
df["product_variant"] = df["product_variant"].replace(
r'(?i)^(?:big\s*ben personalized (?:products\s+)?(?:.*?(?=Adult|Kids))?|personalized\s+)(\w+(?: \w+)*(?: – \w+(?: \w+)*)?)(?: /)? (\d+ set)\b.*',
r'\1 (\2)',
regex=True
)
print(df)
product_variant
0 AVENGERS – Stark (2 set)
1 Expendables – Statham (2 set)
2 Adults Toy (5 set)
3 Adults Toy (3 set)
4 Toy (5 set)
5 Kids Toy (3 set)
6 Kids Toy (5 set)
7 Adults Toy (5 set)
8 Adults Toy (5 set)