Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/fsharp/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 返回不带';在保留并过滤掉包含特定内容的单词时,不能让单词超过某个最大长度_Python_Regex_Pandas_String - Fatal编程技术网

Python 返回不带';在保留并过滤掉包含特定内容的单词时,不能让单词超过某个最大长度

Python 返回不带';在保留并过滤掉包含特定内容的单词时,不能让单词超过某个最大长度,python,regex,pandas,string,Python,Regex,Pandas,String,这是我的数据框 输入 qid question_stemmed target question_length total_words 443216 56da6b6875d686b48fde mathfracint1x53x5 tantanboxedint1x01x2 sumvarp... 1 589 40 163583 1ffca149bd0a19cd714c mathoverbracesumvartheta

这是我的数据框

输入

        qid                     question_stemmed    target  question_length total_words
443216  56da6b6875d686b48fde    mathfracint1x53x5 tantanboxedint1x01x2 sumvarp...   1   589 40
163583  1ffca149bd0a19cd714c    mathoverbracesumvartheta8infty vecfracsumkappa...   1   498 31
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23
        qid                     question_stemmed     target    question_length  total_words
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23
    
我使用以下逻辑仅返回来自df的记录,df的问题_文本列

  • 长度不应超过15个字符的任何单词(注:非字符串 长度(使用否定)
  • 当上述条件为真时,任何不应包含数值的单词 (使用否定)
  • 确保保留具有http或www值的单词(上述两个条件仍然成立)
df=df[(~df['question\u stemed'].str.len()>15)和(~df['question\u stemed'].str.contains(r'[0-9]'))和(df.question\u stemed.str.match('^[^\http]*$)]

获取错误
错误:位置3处的错误转义\h

预期产出

        qid                     question_stemmed    target  question_length total_words
443216  56da6b6875d686b48fde    mathfracint1x53x5 tantanboxedint1x01x2 sumvarp...   1   589 40
163583  1ffca149bd0a19cd714c    mathoverbracesumvartheta8infty vecfracsumkappa...   1   498 31
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23
        qid                     question_stemmed     target    question_length  total_words
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23
    
另外,想知道上述逻辑是否能满足所有3个条件 感谢您的帮助

我建议使用

df = df[~df['question_stemmed'].str.contains(r'(?<!\S)(?!\S*(?:http|www\.))\S{15}')]

df=df[~df['question_stemed'].str.contains(r'(?错误是由于
\h
转义造成的,没有这样的字符串转义序列。请澄清一下?因此,您想忽略URL的前两个检查?您能提供上述df的预期输出吗?@WiktorStribiżew-我添加了预期输出。希望这能说明问题。我想基本上过滤掉所有包含wor的行长度大于15且这些字中包含数字内容的ds(例如:mathfracint1x53x5)在确保不过滤字符串内容中包含http或www值的单词的同时,您是否真的要使用
56da6b6875d686b48fde
之类的值来分析
question\u词干
列?这只是qid。要分析的主要内容是“question\u词干”列(请原谅我的错误格式:/).让我尝试更改它以使其更具可读性:DTry
df=df[~df[~df['question\u stemed']].str.contains(r'(?)?