Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python API_Python_Apache Spark_Pyspark_Apache Spark Sql_Unicode Normalization - Fatal编程技术网

Python API

Python API,python,apache-spark,pyspark,apache-spark-sql,unicode-normalization,Python,Apache Spark,Pyspark,Apache Spark Sql,Unicode Normalization,在JVM方面,您需要一个类似于此的转换器: package net.zero323.spark.ml.feature 导入java.text.Normalizer 导入org.apache.spark.ml.UnaryTransformer 导入org.apache.spark.ml.param_ 导入org.apache.spark.ml.util_ 导入org.apache.spark.sql.types.{DataType,StringType} 类Unicode反规范化器(覆盖值uid:

在JVM方面,您需要一个类似于此的转换器:

package net.zero323.spark.ml.feature
导入java.text.Normalizer
导入org.apache.spark.ml.UnaryTransformer
导入org.apache.spark.ml.param_
导入org.apache.spark.ml.util_
导入org.apache.spark.sql.types.{DataType,StringType}
类Unicode反规范化器(覆盖值uid:String)
扩展UnaryTransformer[字符串,字符串,反规范化器]{
def this()=this(可识别的.randomUID(“unicode_normalizer”))
私有val窗体=映射(
“NFC”->Normalizer.Form.NFC,“NFD”->Normalizer.Form.NFD,
“NFKC”->Normalizer.Form.NFKC,“NFKD”->Normalizer.Form.NFKD
)
val form:Param[String]=新的Param(这是“表单”、“unicode表单(NFC、NFD、NFKC、NFKD之一)”,
ParamValidators.inArray(forms.keys.toArray))
def setN(值:字符串):this.type=set(形式,值)
def getForm:String=$(表单)
setDefault(格式->“NFKD”)
覆盖受保护的def CreateTransfunc:String=>String={
val normalizer=forms($(form))
(s:String)=>Normalizer.normalize(s,normalizeperform)
}
覆盖受保护的def validateInputType(inputType:DataType):单位={
require(inputType==StringType,s“输入类型必须是字符串类型,但得到$inputType。”)
}
覆盖受保护的def outputDataType:DataType=StringType
}
相应的构建定义(调整Spark和Scala版本以匹配Spark部署):

name:=“unicode规范化”
版本:=“1.0”
交叉标度偏差:=序号(“2.11.12”、“2.12.8”)
组织:=“net.zero323”
val sparkVersion=“2.4.0”
libraryDependencies++=Seq(
“org.apache.spark”%%“spark核心”%sparkVersion,
“org.apache.spark”%%“spark sql”%sparkVersion,
“org.apache.spark”%%“spark mllib”%sparkVersion
)
在Python方面,您需要一个类似于此的包装器

从pyspark.ml.param.shared导入*
#从pyspark.ml.util导入Spark<2.0中的关键字_only#
仅从pyspark导入关键字_
从pyspark.ml.wrapper导入JavaTransformer
类Unicode反规范化器(JavaTransformer、HasInputCol、HasOutputCol):
@仅关键字_
定义初始化(self,form=“NFKD”,inputCol=None,outputCol=None):
super(UnicodeNormalizer,self)。\uuuu init\uuuu()
self.\u java\u obj=self.\u new\u java\u obj(
“net.zero323.spark.ml.feature.unicondenormalizer”,self.uid)
self.form=Param(self,“form”,
“unicode格式(NFC、NFD、NFKC、NFKD之一)”
#kwargs=火花<2.0时的自启动输入kwargs
kwargs=自我。\输入\ kwargs
self.setParams(**kwargs)
@仅关键字_
def setParams(self,form=“NFKD”,inputCol=None,outputCol=None):
#kwargs=self.setParams._input_kwargs#in Spark<2.0
kwargs=自我。\输入\ kwargs
返回自我设置(**kwargs)
def setForm(自身、值):
返回自组(形式=值)
def getForm(self):
返回self.getOrDefault(self.form)
构建Scala包:

sbt+包
在启动shell或提交时包含它。例如,对于Scala 2.11的Spark构建:

bin/pyspark--jars到/target/scala-2.11/unicode-normalization_2.11-1.0.jar的路径\
--/target/scala-2.11/unicode-normalization_2.11-1.0.jar的驱动程序类路径
你应该准备好出发了。剩下的只是一点regexp的魔力:

从pyspark.sql.functions导入regexp\u replace
normalizer=unicode反规范化器(form=“NFKD”,
inputCol=“text”,outputCol=“text\u规范化”)
df=sc.parallelize([
(1,“马拉开波”),(2,“纽约”),
(3,“圣保罗”),(4,“马德里”)
]).toDF([“id”,“text”])
(正常化器
.变换(df)
.select(regexp_replace(“text_normalized”、“\p{M}”、”))
.show())
## +--------------------------------------+
##| regexp_replace(文本_规范化,\p{M},)|
## +--------------------------------------+
##|马拉开波|
##|纽约|
##|圣保罗|
##|~马德里|
## +--------------------------------------+

请注意,这遵循与内置文本转换器相同的约定,不是空安全的。通过在
createTransformFunc

中检查
null
,您可以很容易地纠正这一问题。一个可能的改进是构建一个定制,它将处理Unicode规范化和相应的Python包装器。它应该减少JVM和Python之间传递数据的总体开销,并且不需要对Spark本身进行任何修改或访问私有API

在JVM方面,您需要一个类似于此的转换器:

package net.zero323.spark.ml.feature
导入java.text.Normalizer
导入org.apache.spark.ml.UnaryTransformer
导入org.apache.spark.ml.param_
导入org.apache.spark.ml.util_
导入org.apache.spark.sql.types.{DataType,StringType}
类Unicode反规范化器(覆盖值uid:String)
扩展UnaryTransformer[字符串,字符串,反规范化器]{
def this()=this(可识别的.randomUID(“unicode_normalizer”))
私有val窗体=映射(
“NFC”->Normalizer.Form.NFC,“NFD”->Normalizer.Form.NFD,
“NFKC”->Normalizer.Form.NFKC,“NFKD”->Normalizer.Form.NFKD
)
val form:Param[String]=新的Param(这是“表单”、“unicode表单(NFC、NFD、NFKC、NFKD之一)”,
ParamValidators.inArray(forms.keys.toArray))
def setN(值:字符串):this.type=set(形式,值)
def getForm:String=$(表单)
setDefault(格式->“NFKD”)
覆盖受保护的def CreateTransfunc:String=>String={
val normalizer=forms($(form))
(s:String)=>Normalizer.normalize(s,normalizeperform)
}
覆盖受保护的def validateInputType(inputType:DataType):单位={
require(inputType==StringType,s“输入类型必须是字符串类型,但得到$inputType。”)
}
覆盖受保护的def outputDataType:DataType=StringType
}
相应的生成定义(a)
import unicodedata
import sys

from pyspark.sql.functions import translate, regexp_replace

def make_trans():
    matching_string = ""
    replace_string = ""

    for i in range(ord(" "), sys.maxunicode):
        name = unicodedata.name(chr(i), "")
        if "WITH" in name:
            try:
                base = unicodedata.lookup(name.split(" WITH")[0])
                matching_string += chr(i)
                replace_string += base
            except KeyError:
                pass

    return matching_string, replace_string

def clean_text(c):
    matching_string, replace_string = make_trans()
    return translate(
        regexp_replace(c, "\p{M}", ""), 
        matching_string, replace_string
    ).alias(c)
df = sc.parallelize([
(1, "Maracaibó"), (2, "New York"),
(3, "   São Paulo   "), (4, "~Madrid"),
(5, "São Paulo"), (6, "Maracaibó")
]).toDF(["id", "text"])

df.select(clean_text("text")).show()
## +---------------+
## |           text|
## +---------------+
## |      Maracaibo|
## |       New York|
## |   Sao Paulo   |
## |        ~Madrid|
## |      Sao Paulo|
## |      Maracaibo|
## +---------------+

import re

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
from unidecode import unidecode

spark = SparkSession.builder.getOrCreate()
data = [(1, "  \\ / \\ {____} aŠdá_ \t =  \n () asd ____aa 2134_ 23_"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"),
        (3, None)]
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
df.show()

for col_name in ["txt"]:
    tmp_dict = {}
    for col_value in [row[0] for row in df.select(col_name).distinct().toLocalIterator()
                      if row[0] is not None]:
        new_col_value = re.sub("[ ,;{}()\\n\\t=\\\/]", "_", col_value)
        new_col_value = re.sub('_+', '_', new_col_value)
        if new_col_value.startswith("_"):
            new_col_value = new_col_value[1:]
        if new_col_value.endswith("_"):
            new_col_value = new_col_value[:-1]
        new_col_value = unidecode(new_col_value)
        tmp_dict[col_value] = new_col_value.lower()
    df = df.na.replace(to_replace=tmp_dict, subset=[col_name])
df.show()
new_col_value = new_col_value.translate(str.maketrans(
                    "ä,ö,ü,ẞ,á,ä,č,ď,é,ě,í,ĺ,ľ,ň,ó,ô,ŕ,š,ť,ú,ů,ý,ž,Ä,Ö,Ü,ẞ,Á,Ä,Č,Ď,É,Ě,Í,Ĺ,Ľ,Ň,Ó,Ô,Ŕ,Š,Ť,Ú,Ů,Ý,Ž",
                    "a,o,u,s,a,a,c,d,e,e,i,l,l,n,o,o,r,s,t,u,u,y,z,A,O,U,S,A,A,C,D,E,E,I,L,L,N,O,O,R,S,T,U,U,Y,Z"))