Python-查找尚未被标记包围的所有URL_Python_Regex

Python-查找尚未被标记包围的所有URL

python regex

Python-查找尚未被标记包围的所有URL,python,regex,Python,Regex,试图找出regex，它检测文本中的URL，除了那些已经被包围的URL，并用标记包围它们 input: "http://google.sk this is an url" result: "<a href="http://google.sk">http://google.sk</a> this is an url" input: "<a href="http://google.sk">http://google.sk</a> this is an

试图找出

regex

，它检测文本中的URL，除了那些已经被

包围的URL，并用标记包围它们

input: "http://google.sk this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"

input: "<a href="http://google.sk">http://google.sk</a> this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"

输入：http://google.sk 这是一个url“
结果：“这是一个url”
输入：“这是一个url”
结果：“这是一个url”

这对我有很大帮助，但它并不期待已经被包围的URL

def fix_urls(text):
    pat_url = re.compile(  r'''
                     (?x)( # verbose identify URLs within text
         (https|http|ftp|gopher) # make sure we find a resource type
                       :// # ...needs to be followed by colon-slash-slash
            (\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                      (/?| # could be just the domain name (maybe w/ slash)
                [^ \n\r"]+ # or stuff then space, newline, tab, quote
                    [\w/]) # resource name ends in alphanumeric or slash
         (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')

    for url in re.findall(pat_url, text):
       text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})

    return text

def fix_url（文本）：
pat_url=re.compile（r''
（？x）（#详细标识文本中的URL
（https | http | ftp | gopher）#确保找到资源类型
：//#…后面必须跟冒号斜杠
（\w+[：.]？）{2，}#至少两个域组，例如（gnosis.）（cx）
（/？|#可能只是域名（可能是斜杠）
[^\n\r“]+#或填充空格、换行符、制表符、引号
[\w/]）#资源名称以字母数字或斜杠结尾
（？=[\s\，>）'“\]]）#断言：后跟白色或子句结尾
)#比赛小组结束
''')
对于re.findall中的url（pat_url，文本）：
text=text.replace（url[0]，“”%{“url”：url[0]}）
返回文本

如果文本中有任何

标记，此函数将再次包装我不想要的URL。你知道怎么做吗

使用负回溯检查

href=“

不在您的URL前面（第二行）：

（？x）#冗长
(?
不要只对正则表达式执行此操作。使用html解析器查找文本节点并编辑文本节点，这样可以避免html属性中已经存在的URL或a
标记之间的URL。发布实际输入文本如果您真的必须使用正则表达式（您不需要），则可以使用简单模式（？）？
(?x) # verbose
(?<!href=\") #don't match already inside hrefs
(https?|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
((?:\w+[:.]?){2,}) # at least two domain groups, e.g. (gnosis.)(cx) fixed capture group*
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r\"]+ # or stuff then space, newline, tab, quote
[\w\/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)'\"\]]) # assert: followed by white or clause ending