如何在PHP中替换Microsoft编码的引号
由于应用程序中存在编码问题,我需要将Microsoft Word版本的单引号和双引号(如何在PHP中替换Microsoft编码的引号,php,string,encoding,character-encoding,Php,String,Encoding,Character Encoding,由于应用程序中存在编码问题,我需要将Microsoft Word版本的单引号和双引号(“”)替换为常规引号('and')。我不需要它们是HTML实体,也无法更改数据库架构 我有两个选择:使用正则表达式或关联数组 有更好的方法吗?考虑到您只想替换一些特定的、识别良好的字符,我会选择数组:您显然不需要重炮兵正则表达式;-) 如果您遇到其他一些特殊字符(该死的Microsoft Word复制粘贴…),您可以在必要时/在识别它们时将它们添加到该数组中 对于您的评论,我能给出的最佳答案可能是以下链接:
“
”)替换为常规引号('and')。我不需要它们是HTML实体,也无法更改数据库架构
我有两个选择:使用正则表达式或关联数组
有更好的方法吗?考虑到您只想替换一些特定的、识别良好的字符,我会选择数组:您显然不需要重炮兵正则表达式;-) 如果您遇到其他一些特殊字符(该死的Microsoft Word复制粘贴…),您可以在必要时/在识别它们时将它们添加到该数组中
对于您的评论,我能给出的最佳答案可能是以下链接: 以及相关代码(引用该页): (这台计算机上没有Microsoft Word,因此我无法自己测试) 我记不清我们在工作中使用了什么(我不是那个必须处理这种输入的人),但它是同一种东西…您的Microsoft编码引号可能是最简单的。如果您知道要替换它们的字符串的编码,您只需将它们替换为
stru\u replace
以下是UTF-8的一个示例,但使用了一个映射数组,其中包含:
如果您需要另一种编码,您可以使用来转换密钥。我们使用了以下代码。它处理了一些特殊字符
$text = str_replace(chr(130), ',', $text); // Baseline single quote
$text = str_replace(chr(132), '"', $text); // Baseline double quote
$text = str_replace(chr(133), '...', $text); // Ellipsis
$text = str_replace(chr(145), "'", $text); // Left single quote
$text = str_replace(chr(146), "'", $text); // Right single quote
$text = str_replace(chr(147), '"', $text); // Left double quote
$text = str_replace(chr(148), '"', $text); // Right double quote
$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');
我已经找到了这个问题的答案。在php中使用
iconv()
函数只需要一行代码:
// replace Microsoft Word version of single and double quotations marks (“ ” ‘ ’) with regular quotes (' and ")
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
如果像我一样,你带着大量的ASCII/Microsoft Word字符来到这里,这些字符对你的CMS或RTE做了奇怪的事情,而iconv不起作用,那么这个疯狂的函数可能正适合你 将此函数保存到文件时,请确保编码为UTF-8。
<?php
/**
* fixMSWord
*
* Replace ASCII chars with UTF-8. Note there are ASCII characters that don't
* correctly map and will be replaced by spaces.
*
* @author Robin Cafolla
* @date 2013-03-22
*/
function fixMSWord($string) {
$map = Array(
'33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*',
'43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4',
'53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>',
'63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H',
'73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R',
'83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\\',
'93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f',
'103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p',
'113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z',
'123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '€', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"',
'133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ',
'143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~',
'153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢',
'163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬',
'173'=> '', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶',
'183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À',
'193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê',
'203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô',
'213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ',
'223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è',
'233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò',
'243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü',
'253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ'
);
$search = Array();
$replace = Array();
foreach ($map as $s => $r) {
$search[] = chr((int)$s);
$replace[] = $r;
}
return str_replace($search, $replace, $string);
}
前面的每一个答案(除了)都会损坏Unicode字符串:
echo convert_smart_quotes("This is Yi: ꑑ. Point ⒒ this breaks Yi. Yi broke–why? I need a longer––point. This makes Han 嗗 mad.");
结果:
This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad.
iconv:
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
结果:
This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad.
PHP注意:iconv():在第1行的PHP shell代码的输入字符串中检测到非法字符
您可以将其更改为//IGNORE
,这将删除字符,但不会翻译它们
这是替换CP1252编码的Microsoft引号的最佳方法。如果它们是Unicode格式的,您需要替换它们,请使用Gumbo的答案:
function convert_cp1252_to_ascii($input, $default = '') {
if ($input === null || $input == '') {
return $default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Use the search/replace arrays if a character needs to be replaced with
* something other than its Unicode equivalent.
*/
$replace = array(
128 => "E", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
129 => "", // UNDEFINED
130 => ",", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
131 => "f", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
132 => ",,", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
133 => "...", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
134 => "t", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
135 => "T", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
136 => "^", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
137 => "%", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
138 => "S", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
139 => "<", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
140 => "OE", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
141 => "", // UNDEFINED
142 => "Z", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
143 => "", // UNDEFINED
144 => "", // UNDEFINED
145 => "'", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
146 => "'", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
147 => "\"", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
148 => "\"", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
149 => "*", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
150 => "-", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
151 => "--", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
152 => "~", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
153 => "TM", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
154 => "s", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
155 => ">", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
156 => "oe", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
157 => "", // UNDEFINED
158 => "z", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
159 => "Y", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
);
$find = array();
foreach (array_keys($replace) as $key) {
$find[] = chr($key);
}
$input = str_replace($find, array_values($replace), $input);
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
}
return $input;
}
function convert_cp1252_to_ascii($input,$default=''){
如果($input==null | |$input==''){
返回$default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding=mb_detect_编码($input,array('Windows-1252','ISO-8859-1'),true);
如果($encoding==“ISO-8859-1”| |$encoding==“Windows-1252”){
/*
*如果需要使用替换字符,请使用搜索/替换数组
*与Unicode等价物不同的东西。
*/
$replace=数组(
128=>“E”,//http://www.fileformat.info/info/unicode/char/20AC/index.htm 欧元符号
129=>“”,//未定义
130 => ",", // http://www.fileformat.info/info/unicode/char/201A/index.htm 单低9引号
131=>“f”,//http://www.fileformat.info/info/unicode/char/0192/index.htm 带钩的拉丁文小写字母F
132 => ",,", // http://www.fileformat.info/info/unicode/char/201e/index.htm 双低9引号
133 => "...", // http://www.fileformat.info/info/unicode/char/2026/index.htm 水平省略号
134=>“t”,//http://www.fileformat.info/info/unicode/char/2020/index.htm 匕首
135=>“T”,//http://www.fileformat.info/info/unicode/char/2021/index.htm 双刃剑
136 => "^", // http://www.fileformat.info/info/unicode/char/02c6/index.htm 修饰字母扬抑重音
137 => "%", // http://www.fileformat.info/info/unicode/char/2030/index.htm 每千符号
138=>“S”、//http://www.fileformat.info/info/unicode/char/0160/index.htm 带CARON的拉丁文大写字母S
139 => " "\"", // http://www.fileformat.info/info/unicode/char/201d/index.htm 右双引号
149 => "*", // http://www.fileformat.info/info/unicode/char/2022/index.htm 子弹头
150 => "-", // http://www.fileformat.info/info/unicode/char/2013/index.htm 冲刺
151 => "--", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM短跑
152 => "~", // http://www.fileformat.info/info/unicode/char/02DC/index.htm 小瓷砖
153=>“TM”,//http://www.fileformat.info/info/unicode/char/2122/index.htm 商标标志
154=>“s”,//http://www.fileformat.info/info/unicode/char/0161/index.htm 带CARON的拉丁文小写字母S
155 => ">", // http://www.fileformat.info/info/unicode/char/203A/index.htm 单直角引号
156=>“oe”,//http://www.fileformat.info/info/unicode/char/0153/index.htm 拉丁小连字OE
157=>“”,//未定义
158=>“z”,//http://www.fileformat.info/info/unicode/char/017E/index.htm 带CARON的拉丁文小写字母Z
159=>“Y”,//http://www.fileformat.info/info/unicode/char/0178/index.htm 带分音符的拉丁文大写字母Y
);
$find=array();
foreach(数组_键($replace)作为$key){
$find[]=chr($key);
}
$input=str_replace($find,array_values($replace),$input);
/*
*因为除了0x80到0x9F之外,ISO-8859-1和CP1252是相同的
function convert_cp1252_to_ascii($input, $default = '') {
if ($input === null || $input == '') {
return $default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Use the search/replace arrays if a character needs to be replaced with
* something other than its Unicode equivalent.
*/
$replace = array(
128 => "E", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
129 => "", // UNDEFINED
130 => ",", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
131 => "f", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
132 => ",,", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
133 => "...", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
134 => "t", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
135 => "T", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
136 => "^", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
137 => "%", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
138 => "S", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
139 => "<", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
140 => "OE", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
141 => "", // UNDEFINED
142 => "Z", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
143 => "", // UNDEFINED
144 => "", // UNDEFINED
145 => "'", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
146 => "'", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
147 => "\"", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
148 => "\"", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
149 => "*", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
150 => "-", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
151 => "--", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
152 => "~", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
153 => "TM", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
154 => "s", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
155 => ">", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
156 => "oe", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
157 => "", // UNDEFINED
158 => "z", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
159 => "Y", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
);
$find = array();
foreach (array_keys($replace) as $key) {
$find[] = chr($key);
}
$input = str_replace($find, array_values($replace), $input);
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
}
return $input;
}