Python 为每个单词添加链接，考虑标点符号、缩写和Unicode_Python_Regex_Python 2.7

Python 为每个单词添加链接，考虑标点符号、缩写和Unicode

python regex python-2.7

Python 为每个单词添加链接，考虑标点符号、缩写和Unicode,python,regex,python-2.7,Python,Regex,Python 2.7,我想为文本中的每个单词添加一个链接示例文本：他说，他确信在美国的“西部”，枪战随时可能在任何地方爆发预期结果： "<a href='xxx.com?word=he'>He</a>'s <i><a href='xxx.com?word=certain'>certain</a></i> <a href='xxx.com?word=in'>in</a> <a href='xxx.com?w

我想为文本中的每个单词添加一个链接

示例文本：他说，他确信在美国的“西部”，枪战随时可能在任何地方爆发

预期结果：

"<a href='xxx.com?word=he'>He</a>'s
 <i><a href='xxx.com?word=certain'>certain</a></i>
 <a href='xxx.com?word=in'>in</a>
 <a href='xxx.com?word=america'>America</a>'s
 “<a href='xxx.com?word=west'>West</a>,”
 <a href='xxx.com?word=it'>it</a>
 <a href='xxx.com?word=could'>could</a>'ve
.... etc

我将输出分成多行，以便于阅读。实际输出应全部为一个字符串，例如：

 "<a href='xxx.com?word=he'>He</a>'s <i><a href='xxx.com?word=certain'>certain</a></i> <a href='xxx.com?word=in'>in</a> <a href='xxx.com?word=america'>America</a>'s “<a href='xxx.com?word=west'>West</a>,” <a href='xxx.com?word=it'>it</a> <a href='xxx.com?word=could'>could</a>'ve ... etc

每个单词都应该有一个链接，即单词本身没有标点和缩略语。链接是小写的。标点符号和缩略语不应该有链接。单词和标点符号是utf-8，带有许多Unicode字符。它将遇到的唯一html元素是和，因此它不是html解析，只是一个标记对。链接应该位于标签内的单词上

我下面的代码适用于简单的测试用例，但对于较长且有重复单词和标记的真实文本，它存在问题：

我的问题：

如何处理句子中重复的单词，无论是在句子中的精确重复，还是标点符号和/或大写He'she，或是部分单词Gurnfight，任何地方，。如果它在空格上完全分开会更容易，但是通过去掉缩略语，然后在标点符号上分开，我不知道如何将链接的单词干净地替换回字符串中。我的正则表达式可以消除收缩，它适用于像'm'和'd'这样的单个字母，但不适用于've'和're。我不知道如何处理标签，例如，如何确保

我是在Python2.7中这样做的，但是javascript与之类似，并且可以使用Unicode，但是没有考虑我的问题，比如标点符号。

正则表达式可以帮助您

要匹配任意长度的单词，可以使用\w+。要忽略单个标记和，可以添加一个前瞻：？！>。这将匹配打开和关闭标记。最后，要忽略缩略语的右侧，可以在匹配之前添加一个lookback：？代码如何不处理大小写和重复单词？也就是说，你现在得到了什么？乍一看，一个简单的替代品，比如这个re.subr？！i> \w+，r，s应该工作得很好。自然语言有很多恶魔，不要重新发明轮子。研究一下如何使用。@RadLexus：我的重复问题是因为我使用了一个循环来替换找到的每个单词，所以它在枪战中找到了gun，这不是我想要的。感谢你的聪明想法：它工作得很好，但它会链接到像‘ve，’d；就像在这个例子中一样，我试图不在它们上面有链接。此外，href链接必须是单词的所有小写字母，而文本在任何地方都保持不变。哇，非常好！谢谢你的聪明代码，正则表达式精灵：。谢谢你的解释，我学到了很多。

# -*- coding: utf-8 -*-
import re

def addLinks(s):
    #adds a link to dictionary for every word in text
    link = "xxx.com?word="

    #strip out 's, 'd, 'l, 'm, 've, 're
    #then split on punctuation
    words = filter(None, re.split("[, \-!?:_;\"“”‘’‹›«»]+",  re.sub("'[(s|d|l|m|(ve)|(re)]? ", " ", s)))
    for w in words:
        linkedWord = "<a href=#'" + link + w.lower() + "'>" + w + "</a>"
        s = s.replace(w,linkedWord,1)
    return s

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
"""
print addLinks(s)

import re

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights
to erupt at any time anywhere," he said holding a gun in his hand.
"""

callback = lambda pat: '<a href="xxx.com?word='+pat.group(1).lower()+'">'+pat.group(1)+'</a>'
result = re.sub(r"(?<!')(?!i>)(\w+)", callback, s)

"<a href="xxx.com?word=i">I</a>'m <i><a href="xxx.com?word=certain">
certain</a></i> <a href="xxx.com?word=in">in</a> <a href="xxx.com?
word=america">America</a>'s "<a href="xxx.com?word=west">West</a>," ...