Python API
在JVM方面,您需要一个类似于此的转换器:Python API,python,apache-spark,pyspark,apache-spark-sql,unicode-normalization,Python,Apache Spark,Pyspark,Apache Spark Sql,Unicode Normalization,在JVM方面,您需要一个类似于此的转换器: package net.zero323.spark.ml.feature 导入java.text.Normalizer 导入org.apache.spark.ml.UnaryTransformer 导入org.apache.spark.ml.param_ 导入org.apache.spark.ml.util_ 导入org.apache.spark.sql.types.{DataType,StringType} 类Unicode反规范化器(覆盖值uid:
package net.zero323.spark.ml.feature
导入java.text.Normalizer
导入org.apache.spark.ml.UnaryTransformer
导入org.apache.spark.ml.param_
导入org.apache.spark.ml.util_
导入org.apache.spark.sql.types.{DataType,StringType}
类Unicode反规范化器(覆盖值uid:String)
扩展UnaryTransformer[字符串,字符串,反规范化器]{
def this()=this(可识别的.randomUID(“unicode_normalizer”))
私有val窗体=映射(
“NFC”->Normalizer.Form.NFC,“NFD”->Normalizer.Form.NFD,
“NFKC”->Normalizer.Form.NFKC,“NFKD”->Normalizer.Form.NFKD
)
val form:Param[String]=新的Param(这是“表单”、“unicode表单(NFC、NFD、NFKC、NFKD之一)”,
ParamValidators.inArray(forms.keys.toArray))
def setN(值:字符串):this.type=set(形式,值)
def getForm:String=$(表单)
setDefault(格式->“NFKD”)
覆盖受保护的def CreateTransfunc:String=>String={
val normalizer=forms($(form))
(s:String)=>Normalizer.normalize(s,normalizeperform)
}
覆盖受保护的def validateInputType(inputType:DataType):单位={
require(inputType==StringType,s“输入类型必须是字符串类型,但得到$inputType。”)
}
覆盖受保护的def outputDataType:DataType=StringType
}
相应的构建定义(调整Spark和Scala版本以匹配Spark部署):
name:=“unicode规范化”
版本:=“1.0”
交叉标度偏差:=序号(“2.11.12”、“2.12.8”)
组织:=“net.zero323”
val sparkVersion=“2.4.0”
libraryDependencies++=Seq(
“org.apache.spark”%%“spark核心”%sparkVersion,
“org.apache.spark”%%“spark sql”%sparkVersion,
“org.apache.spark”%%“spark mllib”%sparkVersion
)
在Python方面,您需要一个类似于此的包装器
从pyspark.ml.param.shared导入*
#从pyspark.ml.util导入Spark<2.0中的关键字_only#
仅从pyspark导入关键字_
从pyspark.ml.wrapper导入JavaTransformer
类Unicode反规范化器(JavaTransformer、HasInputCol、HasOutputCol):
@仅关键字_
定义初始化(self,form=“NFKD”,inputCol=None,outputCol=None):
super(UnicodeNormalizer,self)。\uuuu init\uuuu()
self.\u java\u obj=self.\u new\u java\u obj(
“net.zero323.spark.ml.feature.unicondenormalizer”,self.uid)
self.form=Param(self,“form”,
“unicode格式(NFC、NFD、NFKC、NFKD之一)”
#kwargs=火花<2.0时的自启动输入kwargs
kwargs=自我。\输入\ kwargs
self.setParams(**kwargs)
@仅关键字_
def setParams(self,form=“NFKD”,inputCol=None,outputCol=None):
#kwargs=self.setParams._input_kwargs#in Spark<2.0
kwargs=自我。\输入\ kwargs
返回自我设置(**kwargs)
def setForm(自身、值):
返回自组(形式=值)
def getForm(self):
返回self.getOrDefault(self.form)
构建Scala包:
sbt+包
在启动shell或提交时包含它。例如,对于Scala 2.11的Spark构建:
bin/pyspark--jars到/target/scala-2.11/unicode-normalization_2.11-1.0.jar的路径\
--/target/scala-2.11/unicode-normalization_2.11-1.0.jar的驱动程序类路径
你应该准备好出发了。剩下的只是一点regexp的魔力:
从pyspark.sql.functions导入regexp\u replace
normalizer=unicode反规范化器(form=“NFKD”,
inputCol=“text”,outputCol=“text\u规范化”)
df=sc.parallelize([
(1,“马拉开波”),(2,“纽约”),
(3,“圣保罗”),(4,“马德里”)
]).toDF([“id”,“text”])
(正常化器
.变换(df)
.select(regexp_replace(“text_normalized”、“\p{M}”、”))
.show())
## +--------------------------------------+
##| regexp_replace(文本_规范化,\p{M},)|
## +--------------------------------------+
##|马拉开波|
##|纽约|
##|圣保罗|
##|~马德里|
## +--------------------------------------+
请注意,这遵循与内置文本转换器相同的约定,不是空安全的。通过在
createTransformFunc
中检查null
,您可以很容易地纠正这一问题。一个可能的改进是构建一个定制,它将处理Unicode规范化和相应的Python包装器。它应该减少JVM和Python之间传递数据的总体开销,并且不需要对Spark本身进行任何修改或访问私有API
在JVM方面,您需要一个类似于此的转换器:
package net.zero323.spark.ml.feature
导入java.text.Normalizer
导入org.apache.spark.ml.UnaryTransformer
导入org.apache.spark.ml.param_
导入org.apache.spark.ml.util_
导入org.apache.spark.sql.types.{DataType,StringType}
类Unicode反规范化器(覆盖值uid:String)
扩展UnaryTransformer[字符串,字符串,反规范化器]{
def this()=this(可识别的.randomUID(“unicode_normalizer”))
私有val窗体=映射(
“NFC”->Normalizer.Form.NFC,“NFD”->Normalizer.Form.NFD,
“NFKC”->Normalizer.Form.NFKC,“NFKD”->Normalizer.Form.NFKD
)
val form:Param[String]=新的Param(这是“表单”、“unicode表单(NFC、NFD、NFKC、NFKD之一)”,
ParamValidators.inArray(forms.keys.toArray))
def setN(值:字符串):this.type=set(形式,值)
def getForm:String=$(表单)
setDefault(格式->“NFKD”)
覆盖受保护的def CreateTransfunc:String=>String={
val normalizer=forms($(form))
(s:String)=>Normalizer.normalize(s,normalizeperform)
}
覆盖受保护的def validateInputType(inputType:DataType):单位={
require(inputType==StringType,s“输入类型必须是字符串类型,但得到$inputType。”)
}
覆盖受保护的def outputDataType:DataType=StringType
}
相应的生成定义(a)
import unicodedata
import sys
from pyspark.sql.functions import translate, regexp_replace
def make_trans():
matching_string = ""
replace_string = ""
for i in range(ord(" "), sys.maxunicode):
name = unicodedata.name(chr(i), "")
if "WITH" in name:
try:
base = unicodedata.lookup(name.split(" WITH")[0])
matching_string += chr(i)
replace_string += base
except KeyError:
pass
return matching_string, replace_string
def clean_text(c):
matching_string, replace_string = make_trans()
return translate(
regexp_replace(c, "\p{M}", ""),
matching_string, replace_string
).alias(c)
df = sc.parallelize([
(1, "Maracaibó"), (2, "New York"),
(3, " São Paulo "), (4, "~Madrid"),
(5, "São Paulo"), (6, "Maracaibó")
]).toDF(["id", "text"])
df.select(clean_text("text")).show()
## +---------------+
## | text|
## +---------------+
## | Maracaibo|
## | New York|
## | Sao Paulo |
## | ~Madrid|
## | Sao Paulo|
## | Maracaibo|
## +---------------+
import re
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
from unidecode import unidecode
spark = SparkSession.builder.getOrCreate()
data = [(1, " \\ / \\ {____} aŠdá_ \t = \n () asd ____aa 2134_ 23_"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"),
(3, None)]
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
df.show()
for col_name in ["txt"]:
tmp_dict = {}
for col_value in [row[0] for row in df.select(col_name).distinct().toLocalIterator()
if row[0] is not None]:
new_col_value = re.sub("[ ,;{}()\\n\\t=\\\/]", "_", col_value)
new_col_value = re.sub('_+', '_', new_col_value)
if new_col_value.startswith("_"):
new_col_value = new_col_value[1:]
if new_col_value.endswith("_"):
new_col_value = new_col_value[:-1]
new_col_value = unidecode(new_col_value)
tmp_dict[col_value] = new_col_value.lower()
df = df.na.replace(to_replace=tmp_dict, subset=[col_name])
df.show()
new_col_value = new_col_value.translate(str.maketrans(
"ä,ö,ü,ẞ,á,ä,č,ď,é,ě,í,ĺ,ľ,ň,ó,ô,ŕ,š,ť,ú,ů,ý,ž,Ä,Ö,Ü,ẞ,Á,Ä,Č,Ď,É,Ě,Í,Ĺ,Ľ,Ň,Ó,Ô,Ŕ,Š,Ť,Ú,Ů,Ý,Ž",
"a,o,u,s,a,a,c,d,e,e,i,l,l,n,o,o,r,s,t,u,u,y,z,A,O,U,S,A,A,C,D,E,E,I,L,L,N,O,O,R,S,T,U,U,Y,Z"))