删除多个尾随连字符PHP正则表达式（未包含在字符中？？）_Php_Html_Regex

删除多个尾随连字符PHP正则表达式（未包含在字符中？？）

php html regex

删除多个尾随连字符PHP正则表达式（未包含在字符中？？）,php,html,regex,Php,Html,Regex,我之前在这里问过一个问题，但我决定应该把这个问题分解成多个问题（这有助于我进一步调试，以更准确地了解我需要什么！）这里的另一个用户提供了一个非常好的regex密钥来检测和超链接URL，该URL分为以下几个部分： $rexProtocol = '(https?://)?'; $rexDomain = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})'; $rexPort = '(:

我之前在这里问过一个问题，但我决定应该把这个问题分解成多个问题（这有助于我进一步调试，以更准确地了解我需要什么！）

这里的另一个用户提供了一个非常好的regex密钥来检测和超链接URL，该URL分为以下几个部分：

$rexProtocol = '(https?://)?';
$rexDomain   = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$rexPort     = '(:[0-9]{1,5})?';
$rexPath     = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
$rexQuery    = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

这是一个很好的方法来分解一个URL到我，虽然这当然是来自于一个正在努力更熟悉REGEX引擎世界的人。在有条件的情况下，许多好的案例都会被抓住：

while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position)) {...

然而，我发现有一件事让我有点沮丧，那就是这并不能完全捕捉到一个链接，而忽略了尾随的标点符号和其他字符（它只对链接末尾的一个标点符号起作用，等等）。因此，我决定处理条件，经过一些调整和研究，发现以下条件工作得更好-

/s

被替换为

：

    while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"\'-]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))

这有效地覆盖了一个句子中URL末尾尾随的大多数非字母数字字符。您可能会认为这将包括连字符，但出于某种原因，它并没有-只从URL末尾删除一个连字符，而不删除其余的连字符，从而防止我通过一个多个连字符的语句过滤URL。有没有关于更改REGEX键或代码中其他内容的建议？下面是我修改过的代码的其余部分：

function formatTextLinksVerbose($text) {
    $rexProtocol = '(https?://)?';
    $rexDomain   = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
    $rexPort     = '(:[0-9]{1,5})?';
    $rexPath     = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
    $rexQuery    = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
    $rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

    $validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);

    $position = 0;
    $returnText = "";
    while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($url, $urlPosition) = $match[0];

        // Append the text leading up to the URL in return value.
        $returnText .= htmlspecialchars(substr($text, $position, $urlPosition - $position));

        $domain = $match[2][0];
        $port   = $match[3][0];
        $path   = $match[4][0];

        // Check if the TLD is valid - or that $domain is an IP address.
        $tld = strtolower(strrchr($domain, '.'));
        if (preg_match('{\.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))
        {
            // Prepend http:// if no protocol specified
            $completeUrl = $match[1][0] ? $url : "http://$url";

            // Append the hyperlink.
            $returnText .= '<a href="' . htmlspecialchars($completeUrl) . '">' . htmlspecialchars("$domain$port$path") . '</a>';
        }
        else
        {
            // Not a valid URL.
            $returnText .= htmlspecialchars($url);
        }

        // Continue text parsing from after the URL.
        $position = $urlPosition + strlen($url);
    }

    // Append and return the remainder of the text.
    return($returnText . htmlspecialchars(substr($text, $position)));
}

函数formatTextLinksVerbose（$text）{
$rexProtocol='（https？：/）？'；
$rexDomain='（（？：[-a-zA-Z0-9]{1,63}\）+[-a-zA-Z0-9]{2,63}}|（？：[0-9]{1,3}\）{3}[0-9]{1,3}）；
$rexPort='（：[0-9]{1,5}）；
$rexPath='（/[！$-/0-9:；=@ \'：；！a-zA-Z\x7f-\xff]*？）；
$rexQuery='（\？[！$-/0-9:；=@\：；！a-zA-Z\x7f-\xff]+？）；
$rexFragment='（#[！$-/0-9:；=@\'：；！a-zA-Z\x7f-\xff]+？）；
$validTlds=数组填充键（分解（“，"com.coop.edu.info.int.jobs.mil.mobi.museum.name.net.org.pro.tel.travel.ac.ad.ae.af.ag.ai.al.am.an.ao.aq.ar.as.at.au.aw.ax.az.ba.bb.bd.be.bf.bg.bh.bh.bb.bi.bt.bw.bw.by.bz.ca.cc.cd.cf.cg.ch.ci.cl.cm.cn.co.cx.cx.cx.cx.cz.dj.cz.cz.dj.cz.dj.cz.cz.cx.cz.cz.cz.cz.cz.cz.cz.cz.cz.cz.cn.do.dz.ec.ee.eg.er.es.et.eu.fi.fj.fk.fm.fo.fr.ga.gb.gd.ge.gf.gg.gh.gi.gl.gm.gn.gp.gq.gr.gs.gt.gu.gw.gy.hk.hm.hn.hr.ht.hu.id.ie.il.im.in.io.iq.ir.is.it.je.jm.jo.jp.ke.kg.kh.ki.km.kn.kn.kp.kr.kw.ky.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl.kl。mp.mq.mr.ms.mt.mu.mv.mw.mx.my.mz.na.nc.ne.nf.ng.ni.nl.no.np.nr.nz.om.pa.pe.pf.pg.ph.pk.pl.pm.pr.ps.pt.pw.py.qa.re.ro.rs.ru.rw.sa.sb.sc.sd.se.sg.sh.si.sj.sk.sk.sr.st.su.sv.sy.sz.tc.td.tg.th.tj.tk.tv.tv.tv.tv.tva.vc.ve.vg.vi.vn.vu.wf.ws.ye.yt.yu.za.zm.zw.xn--0zwm56d.xn--11b5bs3a9aj6g.xn--80akhbyknj4f.xn--9t4b11yi5a.xn--deba0ad.xn--g6w251d.xn--hgbk6aj7f53bba.xn--hlcj6aya9esc7a.xn--jxalpdlp.xn--kgbechtv.xn--zkzah.arpa），真）；
$position=0；
$returnText=“”；
而（preg_match（{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment（？=[？！，；：\“]）？（.|$）}，$text，$match，preg_OFFSET_CAPTURE，$position））
{
列表（$url，$urlPosition）=$match[0]；
//在返回值中追加URL前面的文本。
$returnText.=htmlspecialchars（substr（$text，$position，$urlPosition-$position））；
$domain=$match[2][0]；
$port=$match[3][0]；
$path=$match[4][0]；
//检查TLD是否有效，或者$domain是否为IP地址。
$tld=strtolower（strrchr（$domain，'.'）；
if（preg_match（'{\[0-9]{1,3}}'，$tld）| | isset（$validTlds[$tld]））
{
//如果未指定协议，则在http://前加前缀
$completeUrl=$match[1][0]？$url:“http://$url”；
//附加超链接。
$returnText.=''；
}
其他的
{
//不是有效的URL。
$returnText.=htmlspecialchars（$url）；
}
//从URL后面继续文本解析。
$position=$urlPosition+strlen（$url）；
}
//追加并返回文本的其余部分。
return（$returnText.htmlspecialchars（substr（$text，$position））；
}

（顺便说一句，我意识到htmlspecialchars应该保护我提交到此页面的表单不受用户的不当行为影响，但是函数中是否有一个地方可以让我不再担心这个问题？我应该解密回函数外的非HTML字符串吗？看到输出包含双引号作为“&qout”字符代码）

不是对您的问题的回答。只是一般性的观察。
您可以排除一些正则表达式部分，并使用命名的捕获组
这样，当您更改/修改代码时，就不必重做代码体了
正则表达式

您指的是包含“”的文本吗"在url之后立即？我看不出如何确定这是否是url的一部分。我建议您寻找一个空白字符来表示url的结束。我认为这个键的布局是为了让您可以，事实上，在这个过滤的结束TLD之后过滤非字母数字字符。举个例子，我使用了这个函数为“www.stackoverflow.com…”、“www.stackoverflow.com！！！”和“www.stackoverflow.com_____;”工作。除了连字符外，每个标点字符大小写都有效。我不知道为什么。我注意到，当我在（.|美元）之前显式指定它时，它会过滤多个连字符，但我不能使用“+”或“*”。使用“在url中，当

$prot   = '(?<Protocol>https?://)?';
$domain = '(?<Domain>(?:(?&lt){1,63}\.)+(?&lt){2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$port   = '(?<Port>:[0-9]{1,5})?';
$other  = '(?<Path>/(?&txt)*?)?(?<Query>\?(?&txt)+?)?(?<Fragment>\#(?&txt)+?)?';
$def    = '(?(DEFINE)(?<lt>[-a-zA-Z0-9])(?<txt>[!$-/0-9:;=@_\'a-zA-Z\x7f-\xff]))';

$regex = "$prot$domain$port$other$def"; 

while (preg_match("{\\b$regex(?=[?.!,;:\"]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
{

}

while (
   preg_match(
   '~
        (?<Protocol> https?:// )?     # (1)
        (?<Domain>                    # (2)
             (?:
                  (?&lt){1,63} \.
             )+
             (?&lt){2,63} 
          |  (?: [0-9]{1,3} \. ){3}
             [0-9]{1,3} 
        )
        (?<Port> : [0-9]{1,5} )?      # (3)
        (?<Path>                      # (4)
             / (?&txt)*? 
        )?
        (?<Query>                     # (5)
             \? (?&txt)+? 
        )?
        (?<Fragment>                  # (6)
             \# (?&txt)+? 
        )?
        (?(DEFINE)
             (?<lt> [-a-zA-Z0-9] )         # (7)
             (?<txt>                       # (8)
                  [!$-/0-9:;=@_\'a-zA-Z\x7f-\xff] 
             )
        )
        (?=[?.!,;:"]?(.|$))
   ~x'
   , $text, $match, PREG_OFFSET_CAPTURE, $position))
{

}