Python 为什么字符串替换函数不考虑单词边界\b正则表达式？_Python_Regex_Pandas

Python 为什么字符串替换函数不考虑单词边界\b正则表达式？

python regex pandas

Python 为什么字符串替换函数不考虑单词边界\b正则表达式？,python,regex,pandas,Python,Regex,Pandas,我有这段代码来删除给定数据系列中的所有rt（或转发）。然而，这不起作用，因为我仍然到处看到rt def pre_process(text): newdataset['tidytext'] = newdataset['text'].str.lower() newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "") newdataset['tidytext'] = newdataset

我有这段代码来删除给定数据系列中的所有rt（或转发）。然而，这不起作用，因为我仍然到处看到rt

def pre_process(text):

    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

但它删除了所有rt，使驱逐出境和部分pa

非常感谢

我有这个数据的示例屏幕截图：

对延迟上载示例文件表示歉意：

正如您在文件中看到的，我选择了不同的模式，例如：

RT:
RT@
RT 
RT inside the sentence

我还确保有一些词，如deport、part、article和其他词可以正确地想象我的问题

非常感谢。

如果您使用的是pandas版本

0.23.0

（0.23.0之后取决于

regex

参数

replace

）

replace

方法是pandas系列的一种特定的

str

方法，它与Python中的原生

str.replace

方法不直接匹配

我建议在函数外部编译正则表达式，以便您可以跨调用重用编译后的正则表达式：

rt_regex = re.compile(r'\brt\b')

def pre_process(text):
    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(rt_regex, "")

暗示 1.）按如下方式重写用户函数：

def pre_process(s):
    s = s.str.lower()
    s = s.str.replace(r'\brt\b', "")
    s = s.replace(r'@\w+', '', regex=True)
    s = s.replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

    return s

返回：

ap my troops arrest ro suspects 6 buddhists killed httpapnewszqzoyhz 
my troops arrest ro suspects six buddhists killed accused httpnewspaperstread111326479 august 05 2017 at 0652pm ussupportll
my govnt probe finds no campaign of abuse against ro httpowlymdqb50dhfmk
my rejects allegations of human rights abuses against ro httpreutrs2wwuepg httptwittercomreutersstatus894153592306884608
this is part of a bigger problem we don’t need to deport them
north of ny is a good place to move into
this article is very sensationalist
you cant just  all of my tweetssome are part of a bigger storye18
calls for aearly morning prayer please

证明在查看您的SampleTweet之后，我认为问题在于如何调用方法，而不是正则表达式的问题

在用户函数

pre_process（text）

中，内部方法调用数据帧系列上的引用操作在用户函数的范围内。

用户功能我指的是您共享的代码：

def pre_process(text):

    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

希望有帮助

@anky我的坏。。。我没有理解问题中的

dataseries

。删除评论并投票重新打开

newdataset['tidytext']=newdataset['tidytext'].str.replace（r'（？i）（rt）'），

似乎对我有效。不确定为什么需要单词边界，您应该包括一个示例来说明什么不起作用，也可能是您可以查看

词干分析

，只是guessing@anky谢谢但这也删除了所有rt，而不仅仅是独立的单词“rt”。它删除了部分rt，驱逐出境中的rt，排序中的rt。我将发布一些示例。然后可能是try

.str.replace（r'（？I）\brt\b'，“”）

发布一些失败的示例文本可能会有所帮助（不是图片）。请参阅您提到的文档：“pat:str或编译的正则表达式字符串可以是字符序列或正则表达式。”这将取决于版本；对于

0.23.0

之前的任何版本的pandas，用户必须首先显式编译正则表达式。从0.23.0开始，只要

regex

参数为True，pandas就会将其编译为正则表达式。这仍然不能解释为什么

r'\brt\b'

，实际上是

\brt\b

字符串，仍然影响

等词。很明显，OP代码不能产生所描述的结果，不管它是否是编译的正则表达式。因此，您的回答与当前问题无关。问题中没有任何内容表明\brt\b
影响驱逐出境。在这种情况下，替换模式只是rt，这似乎表明它没有被解析为正则表达式。如果OP可以扩展并添加一个文本案例，我会删除答案，如果它回答了错误的问题。请参阅“但它删除了所有rt，使驱逐出境和部分pa”哇。非常感谢你，詹姆斯！
ap my troops arrest ro suspects 6 buddhists killed httpapnewszqzoyhz 
my troops arrest ro suspects six buddhists killed accused httpnewspaperstread111326479 august 05 2017 at 0652pm ussupportll
my govnt probe finds no campaign of abuse against ro httpowlymdqb50dhfmk
my rejects allegations of human rights abuses against ro httpreutrs2wwuepg httptwittercomreutersstatus894153592306884608
this is part of a bigger problem we don’t need to deport them
north of ny is a good place to move into
this article is very sensationalist
you cant just  all of my tweetssome are part of a bigger storye18
calls for aearly morning prayer please 

def pre_process(text):

    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

def pre_process(s):
    s = s.str.lower()
    s = s.str.replace(r'\brt\b', "")
    s = s.replace(r'@\w+', '', regex=True)
    s = s.replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

    return s

newdataset['tidytext'] = pre_process(newdataset['text'])