Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/316.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python-查找尚未被标记包围的所有URL_Python_Regex - Fatal编程技术网

Python-查找尚未被标记包围的所有URL

Python-查找尚未被标记包围的所有URL,python,regex,Python,Regex,试图找出regex,它检测文本中的URL,除了那些已经被包围的URL,并用标记包围它们 input: "http://google.sk this is an url" result: "<a href="http://google.sk">http://google.sk</a> this is an url" input: "<a href="http://google.sk">http://google.sk</a> this is an

试图找出
regex
,它检测文本中的URL,除了那些已经被
包围的URL,并用标记包围它们

input: "http://google.sk this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"

input: "<a href="http://google.sk">http://google.sk</a> this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"
输入:http://google.sk 这是一个url“
结果:“这是一个url”
输入:“这是一个url”
结果:“这是一个url”
这对我有很大帮助,但它并不期待已经被包围的URL

def fix_urls(text):
    pat_url = re.compile(  r'''
                     (?x)( # verbose identify URLs within text
         (https|http|ftp|gopher) # make sure we find a resource type
                       :// # ...needs to be followed by colon-slash-slash
            (\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                      (/?| # could be just the domain name (maybe w/ slash)
                [^ \n\r"]+ # or stuff then space, newline, tab, quote
                    [\w/]) # resource name ends in alphanumeric or slash
         (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')

    for url in re.findall(pat_url, text):
       text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})

    return text
def fix_url(文本):
pat_url=re.compile(r''
(?x)(#详细标识文本中的URL
(https | http | ftp | gopher)#确保找到资源类型
://#…后面必须跟冒号斜杠
(\w+[:.]?){2,}#至少两个域组,例如(gnosis.)(cx)
(/?|#可能只是域名(可能是斜杠)
[^\n\r“]+#或填充空格、换行符、制表符、引号
[\w/])#资源名称以字母数字或斜杠结尾
(?=[\s\,>)'“\]])#断言:后跟白色或子句结尾
)#比赛小组结束
''')
对于re.findall中的url(pat_url,文本):
text=text.replace(url[0],“”%{“url”:url[0]})
返回文本

如果文本中有任何
标记,此函数将再次包装我不想要的URL。你知道怎么做吗

使用负回溯检查
href=“
不在您的URL前面(第二行):

(?x)#冗长
(?

不要只对正则表达式执行此操作。使用html解析器查找文本节点并编辑文本节点,这样可以避免html属性中已经存在的URL或
a
标记之间的URL。发布实际输入文本如果您真的必须使用正则表达式(您不需要),则可以使用简单模式
(?)?
(?x) # verbose
(?<!href=\") #don't match already inside hrefs
(https?|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
((?:\w+[:.]?){2,}) # at least two domain groups, e.g. (gnosis.)(cx) fixed capture group*
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r\"]+ # or stuff then space, newline, tab, quote
[\w\/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)'\"\]]) # assert: followed by white or clause ending