Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/265.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Php 为什么在清理URL时删除了一些波斯语字符?_Php_Regex_Pcre - Fatal编程技术网

Php 为什么在清理URL时删除了一些波斯语字符?

Php 为什么在清理URL时删除了一些波斯语字符?,php,regex,pcre,Php,Regex,Pcre,下面是我用来清理URL的函数: function make_clean_url($url){ $url_word_separator = "-"; // To replace new lines with space $url = preg_replace('/\n+/', " ", $url); // To replace spaces with - $url = preg_replace('/\s+/', "-", $url); //

下面是我用来清理URL的函数:

function make_clean_url($url){

    $url_word_separator = "-";

    // To replace new lines with space
    $url = preg_replace('/\n+/', " ", $url);

    // To replace spaces with -
    $url = preg_replace('/\s+/', "-", $url);

    // To replace dot(s) with -
    $url = preg_replace('/\.+/', "-", $url);

    // To remove html-entitis characters i.e «
    $url = preg_replace("/&#?[a-z0-9]+;/i","",$url);

    // To remove eveything except numbers, dash, number-sign, space and alphabet characters
    $url = preg_replace('/[^\x{600}-\x{6FF}a-zA-Z0-9 #\-]/u', '', $url); -- issue on this

    // To trim surrounded spaces and dashs
    $url = trim($url, " $url_word_separator");

    return $url;
}
这个正则表达式也适用于大多数URL。因为有一个例外:

echo make_clean_url("اﺻﻠﯽ ﺗﺮﯾﻦ ﻓﺮق اﺳﺘﻌﺎره ﻣﺼﺮﺣﻪ و ﻣﮑﻨﯿﻪ ﭼﯿﺴﺖ؟");
//=> ا--ق-اره--و--؟

看到了吗?它删除了大部分字母。为什么?这些字符是波斯语,这是\x{600}-\x{6FF}允许的。那么为什么要删除它们呢?

要删除的字符不在\u0600-\u06ff范围内,因此该行为是预期的。这些是波斯字母的罕见形式,例如。ﭼ 与چ不同

您可能需要使用\p{Arabic}而不是\x{0600}-\x{06ff}包含阿拉伯语脚本。这就是整个功能:

function make_clean_url($url) {
    $url_word_separator = '-';
    $url = preg_replace('/\R+/', ' ', $url);
    $url = preg_replace('/[\s.]+/', '-', $url);
    $url = preg_replace('/&#?[a-z0-9]+;|[^\p{Arabic}a-z0-9#-]+/ui', '', $url);
    $url = trim($url, " $url_word_separator");
    return $url;
}
请参见

In/[^\x{600}-\x{6FF}a-zA-Z0-9\-]/u,a-zA-Z仅支持ASCII字母。试试/[^\x{600}-\x{6FF}\p{L}0-9-]+/u