如何找到&；在Python中用正则表达式替换URI片段？_Python_Regex_Text_Refactoring_Uri

如何找到&；在Python中用正则表达式替换URI片段？

python regex text

如何找到&；在Python中用正则表达式替换URI片段？,python,regex,text,refactoring,uri,Python,Regex,Text,Refactoring,Uri,你好我试图在文本文件中查找并替换URI片段，但我不知道如何才能做到这一点有些资源以URL开头（例如http://www.example.com/{fragment}），其他以定义的前缀开头（例如，示例：{fragment}）。两个片段代表同一个对象，因此对一个实例的任何更改都必须对前缀和URL片段的所有实例进行更改，反之亦然下面是一个例子：每次http://www.example.com/Example_1或示例：示例_1显示我要替换文件中所有出现的片段示例_1，用于UUID（例如186

你好

我试图在文本文件中查找并替换URI片段，但我不知道如何才能做到这一点

有些资源以URL开头（例如

http://www.example.com/{fragment}

），其他以定义的前缀开头（例如，

示例：{fragment}

）。两个片段代表同一个对象，因此对一个实例的任何更改都必须对前缀和URL片段的所有实例进行更改，反之亦然

下面是一个例子：

每次

http://www.example.com/Example_1

或

示例：示例_1

显示我要替换文件中所有出现的片段

示例_1

，用于UUID（例如

186e4707\u afc8\u 4d0d\u 8c56\u 26e595eba8f0

），导致所有出现的情况都被替换为

http://www.example.com/186e4707_afc8_4d0d_8c56_26e595eba8f0

或

示例：186e4707\u afc8\u 4d0d\u 8c56\u 26e595eba8f0

对于文件中的每个唯一片段都需要这样做，这意味着

示例_2

、

示例_3

等的UUID不同

到目前为止，我已经找到了这行正则表达式：

（（？您可以使用re（Regex）模块替换匹配的模式，让我们看看：
import re
re.sub(pattern, repl, string, count=0, flags=0)

您可以使用repl
参数将函数传递给re.sub，如图所示。因此，您可以使用自己的规则集处理每个匹配
编辑
根据注释进行编辑。存档：…
找到匹配项，然后逐个替换，以便位于文件中不同位置的相同匹配项获得相同的uuid
import uuid
import re


def main():
    text = """  ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word .
###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
            rdfs:subClassOf archive:Quality .
                ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word ."""

    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique matches with unique uuids. Same matches won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(uuid.uuid4()), text)

    print(text)


if __name__ == "__main__":
    main()

导入uuid
进口稀土
def main（）：
text=“”####http://archive.semantyk.com/Abbreviation
归档：缩写rdf:type owl:Class；
rdfs:archive的子类：Word。
###  http://archive.semantyk.com/Ability
归档：能力rdf:type owl:Class；
rdfs：归档子类：质量。
###  http://archive.semantyk.com/Abbreviation
归档：缩写rdf:type owl:Class；
rdfs:存档的子类：Word。”“”
#首先找到需要改变的东西。
rg=r“存档：（[^\s]+）”
matches=re.findall（rg，text，re.M）
#将列表转换为集合以消除重复匹配
#然后再次转换回列表
唯一匹配项=列表（设置（匹配项））
#使用唯一UUID更改唯一匹配项。相同的匹配项将不会获得
#不同uuid
对于唯一_匹配中的匹配：
pattern=r“（？你是说使用re.sub吗？事实并非如此。原始文件包含32000多行代码，其中包含4000多个不同的资源，每个资源或片段都有不同的片段。每个资源或片段在文件中至少有两次出现，并且大多数都不相邻。其中大约有20000次出现（许多都是相同的资源，但仍然是不同的实例）我认为这不太合适。有什么解决方法吗？需要将regex更改为类似（（？始终会有一个对应项，因为所有资源都使用前缀“archive”在文件中声明的，就在一个静音行之后，该行带有资源的完整URI，正如您在编辑主文章的代码的第一行和第二行中看到的，资源“缩写”出现在完整的URI表单和带前缀的表单中。文件中的所有其他资源都会出现这种情况。事实上，如果我们只是修复“存档”：首先，我正在使用的程序将自动更新完整的URI。因此，我们不需要更改注释中的URI部分？这是我给出的代码的一个简单修复。如果你愿意，我可以编辑帖子。。
import re
import uuid

def generateUUID():
    identifier = uuid.uuid4().hex
    identifier = identifier[0:8] + '_' + identifier[8:12] + '_' + identifier[12:16] + '_' + identifier[16:20] + '_' + identifier[20:]
    print('Generated UUID: ' + identifier)
    return identifier

def main():
    text = open('{path}', 'r').read()
    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique words with unique uuids. Same word won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(generateUUID()), text)

    file = open('{path}', 'w')
    file.write(text)
    file.close()

main()

import re
re.sub(pattern, repl, string, count=0, flags=0)

import uuid
import re


def main():
    text = """  ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word .
###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
            rdfs:subClassOf archive:Quality .
                ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word ."""

    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique matches with unique uuids. Same matches won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(uuid.uuid4()), text)

    print(text)


if __name__ == "__main__":
    main()