Python 如果字符串中的一个单词在另一个单词的特定字数内，则替换该单词_Python_Regex_Conditional

Python 如果字符串中的一个单词在另一个单词的特定字数内，则替换该单词

python regex

Python 如果字符串中的一个单词在另一个单词的特定字数内，则替换该单词,python,regex,conditional,Python,Regex,Conditional,我在数据框中有一个名为“DESCRIPTION”的文本列。我需要找到“tile”或“tiles”在单词“roof”的6个单词以内的所有实例，然后将单词“tile/s”更改为“roottiles”。我需要对“地板”和“瓷砖”执行相同的操作（将“瓷砖”更改为“地板”）。这将有助于区分当某些词与其他词结合使用时，我们所关注的建筑行业为了说明我的意思，数据示例和我最近的错误尝试如下： s1=pd.Series(["After the storm the roof was damaged and som

我在数据框中有一个名为“DESCRIPTION”的文本列。我需要找到“tile”或“tiles”在单词“roof”的6个单词以内的所有实例，然后将单词“tile/s”更改为“roottiles”。我需要对“地板”和“瓷砖”执行相同的操作（将“瓷砖”更改为“地板”）。这将有助于区分当某些词与其他词结合使用时，我们所关注的建筑行业

为了说明我的意思，数据示例和我最近的错误尝试如下：

s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])
df

我想要的解决方案应该是这样的（以数据帧格式）：

在这里，我尝试使用正则表达式模式来替换单词“tiles”，但这完全是错误的。。。有没有办法做到我想做的事？我是Python新手

regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])

更新：解决方案谢谢你的帮助！我使用Jan的代码，通过一些添加/调整，成功地使其工作。最终工作代码如下（使用真实的文件和数据，而不是示例）：

我将向您展示一个快速而不完整的实现。您肯定可以使它更加健壮和有用。假设

是您的描述之一：

s = "I dropped the saw and it fell on the roof and damaged roof " +\
    "and some of the tiles"

让我们首先将其分解为单词（标记化；如果需要，可以消除标点符号）：

现在，选择感兴趣的标记并按字母顺序排序，但记住它们在

中的原始位置：

my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens)
                    if w.lower() in ("roof", "tiles"))
#[('roof', 6), ('roof', 12), ('tiles', 17)]

组合相同的标记并创建一个字典，其中标记是键，它们的位置列表是值。使用字典理解：

token_dict = {name: [p0 for _, p0 in pos] 
              for name,pos 
              in itertools.groupby(my_tokens, key=lambda a:a[0])}
#{'roof': [9, 12], 'tiles': [17]}

查看

瓷砖

位置列表（如果有），查看附近是否有

屋顶

，如果有，请更改单词：

for i in token_dict['tiles']:
    for j in token_dict['roof']:
        if abs(i-j) <= 6: 
            tokens[i] = 'rooftiles'

您遇到的主要问题是正则表达式中tile前面的。*。这使得任何数量的字符都可以在那里匹配。\b是不必要的，因为它们位于空白和非空白之间的边界。分组（）也没有被使用，所以我删除了它们

r“（roof\s+[^\s]+\s+{0,6}个tiles”将仅匹配tiles的6个“字”（由空格分隔的非空白字符组）内的roof。要替换它，从正则表达式中取出匹配字符串的最后5个字符以外的所有字符，添加“roottiles”，然后用更新的字符串替换匹配的字符串。或者，您可以在正则表达式中将除了tiles之外的所有内容分组为（），然后将该组替换为自身加上“roof”。您不能将re.sub用于如此复杂的内容，因为它将替换从屋顶到瓷砖的整个匹配，而不仅仅是单词tiles。

我可以将其概括为比“屋顶”和“地板”更多的子字符串，但这似乎是一个更简单的代码：

for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
    if "roof" in r and "tile" in r:
        fill=r[r.find("roof")+4:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
    elif "floor" in r and "tile" in r:
        fill=r[r.find("floor")+5:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))

请注意，这还包括检查完全停止（“.”）。您可以通过删除

sixWords

变量并将其替换为

fill

来删除它。您可以在此处使用带有正则表达式的解决方案：

(                      # outer group
    \b(floor|roof)     # floor or roof
    (?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b           # tile or tiles

请参阅。

然后，只需将捕获的部分组合在一起，并使用

rx.sub（）

将其重新组合在一起，然后将其应用于

DESCRIPTION

列的所有项目，这样您就可以得到以下代码：

import pandas as pd, re

s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])

df = pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])

rx = re.compile(r'''
            (                      # outer group
                \b(floor|roof)     # floor or roof
                (?:\W+\w+){1,6}\s* # any six "words"
            )
            \b(tiles?)\b           # tile or tiles
            ''', re.VERBOSE)

# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
print(df["DESCRIPTION"])

请注意，尽管您最初的问题不太清楚：此解决方案将只在

屋顶

后找到

瓷砖

或

瓷砖

，这意味着类似

的句子您能给我屋顶的瓷砖吗？

将不匹配（尽管单词

瓦片

在

屋顶

的六个单词范围内，也就是说）.

你能把你想要的东西作为输出吗？谢谢，DYZ！我在测试集上得到了这个结果，但当我尝试在csv文件上运行时，我遇到了一点麻烦…我发现Jan的解决方案更容易实现。谢谢Jan！这非常有效！我明白你说的REGEX不能双向运行的意思…我找到了一种方法，只需运行g代码两次…不确定这是否是最好的方法，但它看起来很有效！我已经发布了我用作UpdateThayou的最后一个代码，以供您帮助！但是，我收到了一个错误：TypeError:类型为“float”的参数不可编辑

' '.join(tokens)
#'I dropped the saw and it fell on the roof and damaged roof '+\
#' and some of the rooftiles'

for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
    if "roof" in r and "tile" in r:
        fill=r[r.find("roof")+4:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
    elif "floor" in r and "tile" in r:
        fill=r[r.find("floor")+5:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))

(                      # outer group
    \b(floor|roof)     # floor or roof
    (?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b           # tile or tiles

import pandas as pd, re

s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])

df = pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])

rx = re.compile(r'''
            (                      # outer group
                \b(floor|roof)     # floor or roof
                (?:\W+\w+){1,6}\s* # any six "words"
            )
            \b(tiles?)\b           # tile or tiles
            ''', re.VERBOSE)

# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
print(df["DESCRIPTION"])