Python 返回不带'；在保留并过滤掉包含特定内容的单词时，不能让单词超过某个最大长度_Python_Regex_Pandas_String

Python 返回不带'；在保留并过滤掉包含特定内容的单词时，不能让单词超过某个最大长度

python regex pandas string

Python 返回不带'；在保留并过滤掉包含特定内容的单词时，不能让单词超过某个最大长度,python,regex,pandas,string,Python,Regex,Pandas,String,这是我的数据框输入 qid question_stemmed target question_length total_words 443216 56da6b6875d686b48fde mathfracint1x53x5 tantanboxedint1x01x2 sumvarp... 1 589 40 163583 1ffca149bd0a19cd714c mathoverbracesumvartheta

这是我的数据框

输入

        qid                     question_stemmed    target  question_length total_words
443216  56da6b6875d686b48fde    mathfracint1x53x5 tantanboxedint1x01x2 sumvarp...   1   589 40
163583  1ffca149bd0a19cd714c    mathoverbracesumvartheta8infty vecfracsumkappa...   1   498 31
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

        qid                     question_stemmed     target    question_length  total_words
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

我使用以下逻辑仅返回来自df的记录，df的问题_文本列

长度不应超过15个字符的任何单词（注：非字符串长度（使用否定）
当上述条件为真时，任何不应包含数值的单词（使用否定）
确保保留具有http或www值的单词（上述两个条件仍然成立）

df=df[（~df['question\u stemed'].str.len（）>15）和（~df['question\u stemed'].str.contains（r'[0-9]'））和（df.question\u stemed.str.match（'^[^\http]*$）]

获取错误

错误：位置3处的错误转义\h

预期产出

        qid                     question_stemmed    target  question_length total_words
443216  56da6b6875d686b48fde    mathfracint1x53x5 tantanboxedint1x01x2 sumvarp...   1   589 40
163583  1ffca149bd0a19cd714c    mathoverbracesumvartheta8infty vecfracsumkappa...   1   498 31
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

        qid                     question_stemmed     target    question_length  total_words
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

另外，想知道上述逻辑是否能满足所有3个条件感谢您的帮助

我建议使用

df = df[~df['question_stemmed'].str.contains(r'(?<!\S)(?!\S*(?:http|www\.))\S{15}')]

df=df[~df['question_stemed'].str.contains（r'（？错误是由于\h
转义造成的，没有这样的字符串转义序列。请澄清一下？因此，您想忽略URL的前两个检查？您能提供上述df的预期输出吗？@WiktorStribiżew-我添加了预期输出。希望这能说明问题。我想基本上过滤掉所有包含wor的行长度大于15且这些字中包含数字内容的ds（例如：mathfracint1x53x5）在确保不过滤字符串内容中包含http或www值的单词的同时，您是否真的要使用56da6b6875d686b48fde
之类的值来分析question\u词干
列？这只是qid。要分析的主要内容是“question\u词干”列（请原谅我的错误格式：/）.让我尝试更改它以使其更具可读性：DTrydf=df[~df[~df['question\u stemed']].str.contains（r'（？）？