如何使这个PHP URL解析功能近乎完美？_Php_Regex_Url_Parsing

如何使这个PHP URL解析功能近乎完美？

php regex url parsing

如何使这个PHP URL解析功能近乎完美？,php,regex,url,parsing,Php,Regex,Url,Parsing,这个函数很棒，但它的主要缺陷是它不处理以.co.uk或.com.au结尾的域。如何对其进行修改以处理此问题 function parseUrl($url) { $r = "^(?:(?P<scheme>\w+)://)?"; $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?"; $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" .

这个函数很棒，但它的主要缺陷是它不处理以.co.uk或.com.au结尾的域。如何对其进行修改以处理此问题

function parseUrl($url) {
    $r  = "^(?:(?P<scheme>\w+)://)?";
    $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
    $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
    $r .= "(?::(?P<port>\d+))?";
    $r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
    $r .= "(?:\?(?P<arg>[\w=&]+))?";
    $r .= "(?:#(?P<anchor>\w+))?";
    $r = "!$r!";

    preg_match ( $r, $url, $out );

    return $out;
}

结果：

Array(
[scheme] => http
[host] => sub1.sub2.test.co.uk
)

我想提取的是“test.co.uk”（sans子域），因此首先使用parse_url是一个毫无意义的额外步骤，其中输出与输入相同。

替换此位：

(?P<extension>\w+)

因为扩展名不包含数字或下划线，通常只有2/3个字母（我认为。museum最长，为6…所以10可能是安全的最大值）

如果您这样做，您可能需要添加一个不区分大小写的标志（或者也添加a-Z）

根据您的评论，您希望将匹配的子域部分设置为“lazy”（只有在必要时才匹配），从而允许扩展捕获这两个部分

为此，只需在quanitifer的末尾添加一个

？

，更改：

(?P<subdomain>[-\w\.]+)

（如有必要，可以添加注释来解释其中的任何一个？

内置的语法有什么问题？

这可能会引起人们的兴趣，也可能不会引起人们的兴趣，但我写的一个正则表达式基本上符合（它实际上稍微严格一些，因为它不允许一些更不寻常的URI语法）：

下面是生成它的代码（以及一些选项定义的变体）：

公共静态函数validateUri（$uri，&$components=false，$flags=0）
{
如果（func_num_args（）>3）
{
$flags=array\u slice（func\u get\u args（），2）；
}
if（is_数组（$flags））
{
$flagsArray=$flags；
$flags=array（）；
foreach（$flagsArray作为$flag）
{
if（is_int（$flag））
{
$flags |=$flag；
}
}
}
//设置选项。
$requireScheme=！（$flags&self:：URI\u ALLOW\u NO\u方案）；
$requireAuthority=！（$flags&self:：URI\u ALLOW\u NO\u AUTHORITY）；
$isRelative=（bool）（$flags&self:：URI\u是相对的）；
$requireMultiPartDomain=（bool）（$flags&self:：URI\u REQUIRE\u multipart\u DOMAIN）；
//我们离开了…
//某些字符类型（取自RFC 3986：http://tools.ietf.org/html/rfc3986).
$hex='[\da-f]'；//十六进制数字。
$pct=“（？：%$hex{2}）”；/“编码百分比”值。
$gen='[\[\]：/？\@]'；//通用分隔符。
$sub='[！$&\'（）*+，；=]'；//子分隔符。
$reserved=“（？：$gen |$sub）”；//保留字符。
$unreserved='[\w.\~-]'；//未保留字符。
$pChar=“（？：$unreserved |$pct |$sub |：|@）”//路径字符。
$qfChar=“（？：$pChar |/| \？）”；//查询/片段字符。
//其他实体。
$octet='（？：25[0-5]| 2[0-4]\d |[01]\d\d |\d？\d）'；
$label='[a-z]（？：[0-9a-z-]*（？：[0-9a-z]））？'；
$scheme='（？：（？P[a-z][0-9a-z.+-]*？）：/）；
//权限组件。
$userInfo=“（？：（？P（？：$unreserved |$pct |$sub）*）？：（？P（？：$unreserved |$pct |$sub）*）？（？：$unreserved |$pct |$sub:）*？”；
$ip=“（？P$octet.$octet.$octet.$octet.$octet）”；
如果（$requireMultiPartDomain）
{
$domain=“（？P（？：$label\）+（？：$label））”；
}
其他的
{
$domain=“（？P（？$label\）*（？：$label））”；
}
$host=“（？P$domain |$ip）”；
$port='（？：：（？P\d+））'；
//主层次URI组件。
$authority=“（？P$userInfo$host$port（？=/|$）”；
$path=“（？P/？（？：$pChar+/）*（？：$pChar+/）”；
//最后一位。
$query=“（？：\？（？P$qfChar*？））？”；
$fragment=“（？：#）（？P$qfChar*））？”；
//构建最终的模式。
$pattern='~^'；
//仅当路径不是相对路径时才包括方案和权限。
如果（！$isRelative）
{
如果（$requireScheme）
{
//如果需要该计划，那么管理局也必须在那里。
$pattern.=$scheme.$authority；
}
否则，如果（$requireAuthority）
{
$pattern.=“$scheme？$authority”；
}
其他的
{
$pattern.=“（？：$scheme？$authority）？”；
}
}
其他的
{
//不允许在$path中添加可选斜杠。
$pattern.='（？！/）'；
}
//现在添加标准元素并终止模式。
$pattern.=$path.$query.$fragment.$~i'；
//最后，验证那个笨蛋！
$components=array（）；
$result=（bool）preg_match（$pattern，$uri，$matches）；
如果（$结果）
{
//过滤掉所有无用的数字匹配。
foreach（$匹配为$key=>$value）
{
如果（！is_int（$key））
{
$components[$key]=$value；
}
}
返回true；
}
其他的
{
返回false；
}
}

parse_url（）无法提取子域和域名扩展名。你必须在这里发明你自己的解决方案

我认为一个适当的实现必须包括一个所有域名扩展的库，并定期更新

parse\u url

对格式错误的url过于宽松，OP可能不需要。事实上，原因是我也想去掉所有子域。嗯，既然parse\u url提供了主机名，为什么不写一个（更简单）表达式来拆分子域和扩展？@Fo为什么不使用parse_url进行初始解析，并对它返回的主机名执行进一步解析？恐怕这就是我得到的结果：数组（[0]=>[scheme]=>http[1]=>http[login]=>[2]=>[pass]=>[3]=>[host]=>test.co.uk[4]=>test.co.uk[subdomain]=>test[5]=>test[domain]=>co.uk[6]=>co.uk[extension]=>uk[7]=>uk）它还应该与类似subdomain.subdomain2.test.co.ukI的东西一起工作。我想你可以通过使子域部分变懒来解决这个问题。嗯。。我还没有从中得到一个干净的域名。请享受维护这一混乱局面的乐趣。为什么不在

parse_url（）

？

“我想提取的是”test.co.uk”（sans子域），所以首先使用parse_url是一个pointl
(?P<extension>\w+(?:\.\w+)?)

(?P<extension>[a-z]{2,10}(?:\.[a-z]{2,10})?)

(?P<subdomain>[-\w\.]+)

(?P<subdomain>[-\w\.]+?)

^(?P<subdomains>(?:[\w-]+\.)*?)(?P<domain>[\w-]+(?P<extension>(?:\.[a-z]{2,10}){1,2}))$

^
(?P<subdomains>
  (?:[\w-]+\.)*?
)
(?P<domain>
  [\w-]+
  (?P<extension>
     (?:\.[a-z]{2,10}){1,2}
   )
)$

~^(?:(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)?(?P<authority>(?:(?P<userinfo>(?P<username>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?:(?P<password>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?|(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:)*?)@)?(?P<host>(?P<domain>(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?\.)+(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?))|(?P<ip>(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)))(?::(?P<port>\d+))?(?=/|$)))?(?P<path>/?(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/)*(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/?)?)(?:\?(?P<query>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*?))?(?:#(?P<fragment>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*))?$~i

scheme
authority
  userinfo
    username
    password
  domain
  ip
path
query
fragment

public static function validateUri($uri, &$components = false, $flags = 0)
{
    if (func_num_args() > 3)
    {
        $flags = array_slice(func_get_args(), 2);
    }

    if (is_array($flags))
    {
        $flagsArray = $flags;
        $flags = array();
        foreach ($flagsArray as $flag)
        {
            if (is_int($flag))
            {
                $flags |= $flag;
            }
        }
    }

    // Set options.
    $requireScheme = !($flags & self::URI_ALLOW_NO_SCHEME);
    $requireAuthority = !($flags & self::URI_ALLOW_NO_AUTHORITY);
    $isRelative = (bool)($flags & self::URI_IS_RELATIVE);
    $requireMultiPartDomain = (bool)($flags & self::URI_REQUIRE_MULTI_PART_DOMAIN);

    // And we're away…

    // Some character types (taken from RFC 3986: http://tools.ietf.org/html/rfc3986).
    $hex = '[\da-f]'; // Hexadecimal digit.
    $pct = "(?:%$hex{2})"; // "Percent-encoded" value.
    $gen = '[\[\]:/?#@]'; // Generic delimiters.
    $sub = '[!$&\'()*+,;=]'; // Sub-delimiters.
    $reserved = "(?:$gen|$sub)"; // Reserved characters.
    $unreserved = '[\w.\~-]'; // Unreserved characters.
    $pChar = "(?:$unreserved|$pct|$sub|:|@)"; // Path characters.
    $qfChar = "(?:$pChar|/|\?)"; // Query/fragment characters.

    // Other entities.
    $octet = '(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)';
    $label = '[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?';

    $scheme = '(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)';

    // Authority components.
    $userInfo = "(?:(?P<userinfo>(?P<username>(?:$unreserved|$pct|$sub)*)?:(?P<password>(?:$unreserved|$pct|$sub)*)?|(?:$unreserved|$pct|$sub|:)*?)@)?";
    $ip = "(?P<ip>$octet.$octet.$octet.$octet)";
    if ($requireMultiPartDomain)
    {
        $domain = "(?P<domain>(?:$label\.)+(?:$label))";
    }
    else
    {
        $domain = "(?P<domain>(?:$label\.)*(?:$label))";
    }
    $host = "(?P<host>$domain|$ip)";
    $port = '(?::(?P<port>\d+))?';

    // Primary hierarchical URI components.
    $authority = "(?P<authority>$userInfo$host$port(?=/|$))";
    $path = "(?P<path>/?(?:$pChar+/)*(?:$pChar+/?)?)";

    // Final bits.
    $query = "(?:\?(?P<query>$qfChar*?))?";
    $fragment = "(?:#(?P<fragment>$qfChar*))?";

    // Construct the final pattern.
    $pattern = '~^';

    // Only include scheme and authority if the path is not relative.
    if (!$isRelative)
    {
        if ($requireScheme)
        {
            // If the scheme is required, then the authority must also be there.
            $pattern .= $scheme . $authority;
        }
        else if ($requireAuthority)
        {
            $pattern .= "$scheme?$authority";
        }
        else
        {
            $pattern .= "(?:$scheme?$authority)?";
        }
    }
    else
    {
        // Disallow that optional slash we put in $path.
        $pattern .= '(?!/)';
    }

    // Now add standard elements and terminate the pattern.
    $pattern .= $path . $query . $fragment . '$~i';

    // Finally, validate that sucker!
    $components = array();
    $result = (bool)preg_match($pattern, $uri, $matches);
    if ($result)
    {
        // Filter out all of the useless numerical matches.
        foreach ($matches as $key => $value)
        {
            if (!is_int($key))
            {
                $components[$key] = $value;
            }
        }

        return true;
    }
    else
    {
        return false;
    }
}