Python 带异常的字符串拆分
我使用逗号作为分隔符将字符串拆分为行Python 带异常的字符串拆分,python,pandas,split,Python,Pandas,Split,我使用逗号作为分隔符将字符串拆分为行 for col in [col for col in df.loc[:,df.columns.str.contains(">")]]: #only on colnames containing ">" df[col] = df[col].str.split(", ") df = df.explode(col).reset_index(drop=True) 但是,有三个子
for col in [col for col in df.loc[:,df.columns.str.contains(">")]]: #only on colnames containing ">"
df[col] = df[col].str.split(", ")
df = df.explode(col).reset_index(drop=True)
但是,有三个子字符串,其中逗号“自然”出现,不应导致拆分:
df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})
下面是它应该输出的内容:
+-------------------------------------------------------------------------+
| col > 1 |
+-------------------------------------------------------------------------+
| Personals |
| Financials |
| Data related to sexual preferences, sex life, and/or sexual orientation |
| Personals |
| Financials |
| Vendors |
| Procurement, subcontracting and vendor management |
+-------------------------------------------------------------------------+
;
)您可以在
df.str.split()
中使用带有多个负lookback语句的正则表达式模式,本质上说是“在上拆分行,
,除非,
前面有…”
为了在Python中实现这一点,最好使用多个负lookback断言——Python正则表达式强制使用固定宽度的lookarounds,因此它不像单个负lookback那样简单,其中包含由|
分隔的子句
使用示例中的短语在,
上拆分,除非前面有任何列出的短语,您可以使用:
r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),"
r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),"
col > 1
0 Personals
1 Financials
2 Data related to sexual preferences, sex life,...
3 Personals
4 Financials
5 Vendors
6 Procurement, subcontracting and vendor manage...