如何在Python中查找字符串中的中文或日文字符?
例如:如何在Python中查找字符串中的中文或日文字符?,python,string,unicode,utf-8,character-encoding,Python,String,Unicode,Utf 8,Character Encoding,例如: str = 'sdf344asfasf天地方益3権sdfsdf' 将()添加到中文和日文字符: strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf' 首先,您可以检查字符是否在以下unicode块中: -U+4E00至U+9FFF -U+3400至U+4DBF -U+20000至U+2A6DF -U+2A700至U+2B73F -U+2B740至U+2B81F 之后,您只需遍历字符串,检查字符是否为中文、日文或韩文(CJK),并相应地追加:
str = 'sdf344asfasf天地方益3権sdfsdf'
将()
添加到中文和日文字符:
strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf'
首先,您可以检查字符是否在以下unicode块中:
- -U+4E00至U+9FFF
- -U+3400至U+4DBF
- -U+20000至U+2A6DF
- -U+2A700至U+2B73F
- -U+2B740至U+2B81F
之后,您只需遍历字符串,检查字符是否为中文、日文或韩文(CJK),并相应地追加:
# -*- coding:utf-8 -*-
ranges = [
{"from": ord(u"\u3300"), "to": ord(u"\u33ff")}, # compatibility ideographs
{"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")}, # compatibility ideographs
{"from": ord(u"\uf900"), "to": ord(u"\ufaff")}, # compatibility ideographs
{"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
{'from': ord(u'\u3040'), 'to': ord(u'\u309f')}, # Japanese Hiragana
{"from": ord(u"\u30a0"), "to": ord(u"\u30ff")}, # Japanese Katakana
{"from": ord(u"\u2e80"), "to": ord(u"\u2eff")}, # cjk radicals supplement
{"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
{"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
{"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
{"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
{"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
{"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")} # included as of Unicode 8.0
]
def is_cjk(char):
return any([range["from"] <= ord(char) <= range["to"] for range in ranges])
def cjk_substrings(string):
i = 0
while i<len(string):
if is_cjk(string[i]):
start = i
while is_cjk(string[i]): i += 1
yield string[start:i]
i += 1
string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
for sub in cjk_substrings(string):
string = string.replace(sub, "(" + sub + ")")
print string
为了保证将来的安全,您可能需要留意CJK统一表意文字扩展名E。它会的,也就是。我已经将其添加到范围中,但在Unicode 8.0发布之前,您不应该包含它
[编辑]
添加了,和。您可以使用进行编辑,它支持检查每个字符的Unicode“”属性,并且是
re
软件包的替代品:
import regex as re
pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)
input = u'sdf344asfasf天地方益3権sdfsdf'
output = pattern.sub(r'(\1)', input)
print output # Prints: sdf344asfasf(天地方益)3(権)sdfsdf
<>你应该用你认为是“中文或日文”的字符脚本/块来调整<代码> \p{is…}/COD>序列。 < P>如果你不能使用<代码> ReXEX < /C>模块,该模块提供了对<代码> IsKatakana < /代码>的访问,<代码>伊尚< /代码>属性如图所示;您可以在stdlib的
re
模块中使用以下字符范围:
>>> import re
>>> print(re.sub(u"([\u3300-\u33ff\ufe30-\ufe4f\uf900-\ufaff\U0002f800-\U0002fa1f\u30a0-\u30ff\u2e80-\u2eff\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002b73f\U0002b740-\U0002b81f\U0002b820-\U0002ceaf]+)", r"(\1)", u'sdf344asfasf天地方益3権sdfsdf'))
sdf344asfasf(天地方益)3(権)sdfsdf
小心
您还可以检查Unicode类别:
>>> import unicodedata
>>> unicodedata.category(u'天')
'Lo'
>>> unicodedata.category(u's')
'Ll'
灵感来源于:
在这个答案和@EvenLisle子串答案中,组合是_cjk()
>>> from nltk.tokenize.util import is_cjk
>>> text = u'sdf344asfasf天地方益3権sdfsdf'
>>> [1 if is_cjk(ch) else 0 for ch in text]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
>>> def cjk_substrings(string):
... i = 0
... while i<len(string):
... if is_cjk(string[i]):
... start = i
... while is_cjk(string[i]): i += 1
... yield string[start:i]
... i += 1
...
>>> string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
>>> for sub in cjk_substrings(string):
... string = string.replace(sub, "(" + sub + ")")
...
>>> string
u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
>>> print string
sdf344asfasf(天地方益)3(権)sdfsdf
nltk.tokenize.util中的>>导入为
>>>text=u'sdf344asfasf天地方益3.権sdfsdf'
>>>[1如果是_cjk(ch),则文本中的ch为0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
>>>def cjk_子字符串(字符串):
... i=0
... 而i>>string=“sdf344asfasf天地方益3.権解码(“utf-8”)
>>>对于cjk_子字符串(字符串)中的子字符串:
... string=string.replace(sub,“(“+sub+”)”)
...
>>>串
u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
>>>打印字符串
sdf344asfasf(天地方益)3(権)sdfsdf
这是Python2还是Python3?Python2是Python2的版本,因为它相当广泛,我不想查找范围:您可以从UTF-8解码得到unicode
对象,然后使用正则表达式检测unicode码点的特定范围。这些范围适用于中文和日文,这是研究unicode sta的练习标准相关:从我在上面发布的链接中,你可以对字符进行迭代,并针对不同的CJK范围测试ord
的值,这些范围不包括所有不同的范围:@EdChum我已经更新了我的答案,以包括可用的unicode范围。这些范围将丢失日语假名字符和一系列字符CJK符号、笔划、部首、兼容字符和语音扩展名。检查Unicode“Script”属性更容易、更可靠。我得到一个“TypeError:ord()需要一个字符,但找到了长度为2的字符串”,用于{“from”:ord(u“\U0002a700”),“to”:ord(u“\U0002b73f”)}和所有包含“\u”的其他行。你能看一下吗?谢谢。平假名的范围丢失了。请添加{'from':ord(u'\u3040'),'to':ord(u'\u309f')。你能regex
告诉它属于哪种类型吗?Thx!它可以工作,而@EvenLisle的答案失败:テンポラリ
.Minor quibble:在docstring中键入character:char
——实际上是str
(Python中没有char
类型)
def is_cjk(character):
""""
Checks whether character is CJK.
>>> is_cjk(u'\u33fe')
True
>>> is_cjk(u'\uFE5F')
False
:param character: The character that needs to be checked.
:type character: char
:return: bool
"""
return any([start <= ord(character) <= end for start, end in
[(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215),
(63744, 64255), (65072, 65103), (65381, 65500),
(131072, 196607)]
])
class CJKChars(object):
"""
An object that enumerates the code points of the CJK characters as listed on
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane
This is a Python port of the CJK code point enumerations of Moses tokenizer:
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309
"""
# Hangul Jamo (1100–11FF)
Hangul_Jamo = (4352, 4607) # (ord(u"\u1100"), ord(u"\u11ff"))
# CJK Radicals Supplement (2E80–2EFF)
# Kangxi Radicals (2F00–2FDF)
# Ideographic Description Characters (2FF0–2FFF)
# CJK Symbols and Punctuation (3000–303F)
# Hiragana (3040–309F)
# Katakana (30A0–30FF)
# Bopomofo (3100–312F)
# Hangul Compatibility Jamo (3130–318F)
# Kanbun (3190–319F)
# Bopomofo Extended (31A0–31BF)
# CJK Strokes (31C0–31EF)
# Katakana Phonetic Extensions (31F0–31FF)
# Enclosed CJK Letters and Months (3200–32FF)
# CJK Compatibility (3300–33FF)
# CJK Unified Ideographs Extension A (3400–4DBF)
# Yijing Hexagram Symbols (4DC0–4DFF)
# CJK Unified Ideographs (4E00–9FFF)
# Yi Syllables (A000–A48F)
# Yi Radicals (A490–A4CF)
CJK_Radicals = (11904, 42191) # (ord(u"\u2e80"), ord(u"\ua4cf"))
# Phags-pa (A840–A87F)
Phags_Pa = (43072, 43135) # (ord(u"\ua840"), ord(u"\ua87f"))
# Hangul Syllables (AC00–D7AF)
Hangul_Syllables = (44032, 55215) # (ord(u"\uAC00"), ord(u"\uD7AF"))
# CJK Compatibility Ideographs (F900–FAFF)
CJK_Compatibility_Ideographs = (63744, 64255) # (ord(u"\uF900"), ord(u"\uFAFF"))
# CJK Compatibility Forms (FE30–FE4F)
CJK_Compatibility_Forms = (65072, 65103) # (ord(u"\uFE30"), ord(u"\uFE4F"))
# Range U+FF65–FFDC encodes halfwidth forms, of Katakana and Hangul characters
Katakana_Hangul_Halfwidth = (65381, 65500) # (ord(u"\uFF65"), ord(u"\uFFDC"))
# Supplementary Ideographic Plane 20000–2FFFF
Supplementary_Ideographic_Plane = (131072, 196607) # (ord(u"\U00020000"), ord(u"\U0002FFFF"))
ranges = [Hangul_Jamo, CJK_Radicals, Phags_Pa, Hangul_Syllables,
CJK_Compatibility_Ideographs, CJK_Compatibility_Forms,
Katakana_Hangul_Halfwidth, Supplementary_Ideographic_Plane]
>>> from nltk.tokenize.util import is_cjk
>>> text = u'sdf344asfasf天地方益3権sdfsdf'
>>> [1 if is_cjk(ch) else 0 for ch in text]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
>>> def cjk_substrings(string):
... i = 0
... while i<len(string):
... if is_cjk(string[i]):
... start = i
... while is_cjk(string[i]): i += 1
... yield string[start:i]
... i += 1
...
>>> string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
>>> for sub in cjk_substrings(string):
... string = string.replace(sub, "(" + sub + ")")
...
>>> string
u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
>>> print string
sdf344asfasf(天地方益)3(権)sdfsdf