Python 返回不带';在保留并过滤掉包含特定内容的单词时,不能让单词超过某个最大长度
这是我的数据框 输入Python 返回不带';在保留并过滤掉包含特定内容的单词时,不能让单词超过某个最大长度,python,regex,pandas,string,Python,Regex,Pandas,String,这是我的数据框 输入 qid question_stemmed target question_length total_words 443216 56da6b6875d686b48fde mathfracint1x53x5 tantanboxedint1x01x2 sumvarp... 1 589 40 163583 1ffca149bd0a19cd714c mathoverbracesumvartheta
qid question_stemmed target question_length total_words
443216 56da6b6875d686b48fde mathfracint1x53x5 tantanboxedint1x01x2 sumvarp... 1 589 40
163583 1ffca149bd0a19cd714c mathoverbracesumvartheta8infty vecfracsumkappa... 1 498 31
522266 663c7523d48f5ee66a3e httpgooglecom check out the content of the www.. 0 449 66
522379 756678d3d48f5ee66a3e mark had a great day he plans to go fishing with. 0 310 23
qid question_stemmed target question_length total_words
522266 663c7523d48f5ee66a3e httpgooglecom check out the content of the www.. 0 449 66
522379 756678d3d48f5ee66a3e mark had a great day he plans to go fishing with. 0 310 23
我使用以下逻辑仅返回来自df的记录,df的问题_文本列
- 长度不应超过15个字符的任何单词(注:非字符串 长度(使用否定)
- 当上述条件为真时,任何不应包含数值的单词 (使用否定)
- 确保保留具有http或www值的单词(上述两个条件仍然成立)
df=df[(~df['question\u stemed'].str.len()>15)和(~df['question\u stemed'].str.contains(r'[0-9]'))和(df.question\u stemed.str.match('^[^\http]*$)]
获取错误
错误:位置3处的错误转义\h
预期产出
qid question_stemmed target question_length total_words
443216 56da6b6875d686b48fde mathfracint1x53x5 tantanboxedint1x01x2 sumvarp... 1 589 40
163583 1ffca149bd0a19cd714c mathoverbracesumvartheta8infty vecfracsumkappa... 1 498 31
522266 663c7523d48f5ee66a3e httpgooglecom check out the content of the www.. 0 449 66
522379 756678d3d48f5ee66a3e mark had a great day he plans to go fishing with. 0 310 23
qid question_stemmed target question_length total_words
522266 663c7523d48f5ee66a3e httpgooglecom check out the content of the www.. 0 449 66
522379 756678d3d48f5ee66a3e mark had a great day he plans to go fishing with. 0 310 23
另外,想知道上述逻辑是否能满足所有3个条件
感谢您的帮助我建议使用
df = df[~df['question_stemmed'].str.contains(r'(?<!\S)(?!\S*(?:http|www\.))\S{15}')]
df=df[~df['question_stemed'].str.contains(r'(?错误是由于\h
转义造成的,没有这样的字符串转义序列。请澄清一下?因此,您想忽略URL的前两个检查?您能提供上述df的预期输出吗?@WiktorStribiżew-我添加了预期输出。希望这能说明问题。我想基本上过滤掉所有包含wor的行长度大于15且这些字中包含数字内容的ds(例如:mathfracint1x53x5)在确保不过滤字符串内容中包含http或www值的单词的同时,您是否真的要使用56da6b6875d686b48fde
之类的值来分析question\u词干
列?这只是qid。要分析的主要内容是“question\u词干”列(请原谅我的错误格式:/).让我尝试更改它以使其更具可读性:DTrydf=df[~df[~df['question\u stemed']].str.contains(r'(?)?