Python Pypsark：如何有条件地将函数应用于Spark DataFrame列并填充空值_Python_Apache Spark_Pyspark_Apply_User Defined Functions

Python Pypsark：如何有条件地将函数应用于Spark DataFrame列并填充空值

python apache-spark pyspark

Python Pypsark：如何有条件地将函数应用于Spark DataFrame列并填充空值,python,apache-spark,pyspark,apply,user-defined-functions,Python,Apache Spark,Pyspark,Apply,User Defined Functions,我有一个spark数据框，其中有一列是特定的location\u string，我基本上只想将其分解为3列，分别称为country、region和city。然后我想将这些列与已经存在的国家，地区，城市列合并，以确保填充空值。或者换句话说，我想将我的函数应用于城市、地区或国家为空的行，尝试使用位置字符串填充这些值示例数据集： +--------------------+-----------------+------+-------+ | location_string|

我有一个spark数据框，其中有一列是特定的

location\u string

，我基本上只想将其分解为3列，分别称为

country

、

region

和

city

。然后我想将这些列与已经存在的

国家

，

地区

，

城市

列合并，以确保填充空值。或者换句话说，我想将我的函数应用于

城市

、

地区

或

国家

为空的行，尝试使用

位置字符串

填充这些值

示例数据集：

+--------------------+-----------------+------+-------+
|     location_string|             city|region|country|
+--------------------+-----------------+------+-------+
|Jonesboro, AR, US...|             NULL|    AR|   NULL|
|Lake Village, AR,...|     Lake Village|    AR|    USA|
|Little Rock, AR, ...|      Little Rock|    AR|    USA|
|Little Rock, AR, ...|      Little Rock|    AR|    USA|
|Malvern, AR, US, ...|          Malvern|  NULL|    USA|
|Malvern, AR, US, ...|          Malvern|    AR|    USA|
|Morrilton, AR, US...|        Morrilton|    AR|    USA|
|Morrilton, AR, US...|        Morrilton|    AR|    USA|
|N. Little Rock, A...|North Little Rock|    AR|    USA|
|N. Little Rock, A...|North Little Rock|    AR|    USA|
|Ozark, AR, US, 72949|            Ozark|    AR|    USA|
|Ozark, AR, US, 72949|            Ozark|    AR|    USA|
|Palestine, AR, US...|             NULL|    AR|    USA|
|Pine Bluff, AR, U...|       Pine Bluff|    AR|   NULL|
|Pine Bluff, AR, U...|       Pine Bluff|    AR|    USA|
|Prescott, AR, US,...|         Prescott|    AR|    USA|
|Prescott, AR, US,...|         Prescott|    AR|    USA|
|Searcy, AR, US, 7...|           Searcy|    AR|    USA|
|Searcy, AR, US, 7...|           Searcy|    AR|    USA|
|West Memphis, AR,...|     West Memphis|  NULL|    USA|
+--------------------+-----------------+------+-------+

分解位置字符串的示例函数：

def geocoder_decompose_location(location_string):
    if not location_string:
        return {'country': None, 'state': None, 'city': None}
    GOOGLE_GEOCODE_API_KEY = "<API KEY HERE>"
    result = geocoder.google(location_string, key=GOOGLE_GEOCODE_API_KEY)
    return {'country': result.country, 'state': result.state, 'city': result.city}

def地理编码器分解位置（位置字符串）：
如果不是位置字符串：
返回{'country'：无，'state'：无，'city'：无}
GOOGLE_GEOCODE_API_KEY=“”
结果=geocoder.google（位置\字符串，键=谷歌\地理编码\ API \键）
返回{'country'：result.country，'state'：result.state，'city'：result.city}

scala伪代码

首先，我们需要从df中删除所有重复项（这将减少对google服务的API调用）

导入spark.implicits_
案例类数据（位置\字符串：字符串、城市：字符串、地区：字符串、国家：字符串）
val cleaner=（（位置字符串：字符串）=>{
试一试{
GOOGLE_GEOCODE_API_KEY=“”
val result=geocoder.google（location\u string，key=google\u GEOCODE\u API\u key）
一些（结果）
}抓住{
案例错误：异常=>println（错误）；无；
}
})
output.as[Data].dropDuplicates（“位置字符串”）.map（x=>{
val toCheck=（x.city==null | | x.country==null | | x.region==null）//还可以使用StringUtils.isBlank添加空白检查
如果（检查）{
val结果=清理器（x.location\u字符串）
val city=if（对result.city值进行空检查）result.city else x.city
val country=if（对result.country值进行空检查）result.country else x.country
val region=if（对result.state值进行nullcheck）result.state else x.region
数据（x.location\u字符串、城市、国家/地区）
}其他x
})

我们还可以在删除重复项之前执行orderBy（desc（“城市”）、desc（“国家”）、desc（“州”）），以便在出现重复项时（具有空值的项将被删除）。

scala伪代码

首先，我们需要从df中删除所有重复项（这将减少对google服务的API调用）

导入spark.implicits_
案例类数据（位置\字符串：字符串、城市：字符串、地区：字符串、国家：字符串）
val cleaner=（（位置字符串：字符串）=>{
试一试{
GOOGLE_GEOCODE_API_KEY=“”
val result=geocoder.google（location\u string，key=google\u GEOCODE\u API\u key）
一些（结果）
}抓住{
案例错误：异常=>println（错误）；无；
}
})
output.as[Data].dropDuplicates（“位置字符串”）.map（x=>{
val toCheck=（x.city==null | | x.country==null | | x.region==null）//还可以使用StringUtils.isBlank添加空白检查
如果（检查）{
val结果=清理器（x.location\u字符串）
val city=if（对result.city值进行空检查）result.city else x.city
val country=if（对result.country值进行空检查）result.country else x.country
val region=if（对result.state值进行nullcheck）result.state else x.region
数据（x.location\u字符串、城市、国家/地区）
}其他x
})

我们还可以在删除重复项之前执行orderBy（desc（“城市”）、desc（“国家”）、desc（“州”）），以便在存在重复项时（将删除具有空值的项）.

谢谢-您提出了删除重复项的好观点，这样我可以减少API调用。谢谢-您提出了删除重复项的好观点，这样我可以减少API调用。

import spark.implicits._
    case class Data(location_string:String,city: String,region: String,country: String)
    val cleaner = ((location_string: String) => {
      try{
        GOOGLE_GEOCODE_API_KEY = "<API KEY HERE>"
        val result = geocoder.google(location_string, key=GOOGLE_GEOCODE_API_KEY)
        Some(result)
      } catch {
        case error: Exception => println(error); None;
      }
    })
    output.as[Data].dropDuplicates("location_string").map(x => {
       val toCheck = (x.city ==null || x.country == null || x.region == null) // can also add blank check with StringUtils.isBlank
       if(toCheck){
         val result = cleaner(x.location_string)
         val city = if(nullcheck on result.city value) result.city else x.city
         val country = if(nullcheck on result.country value) result.country else x.country
         val region = if(nullcheck on result.state value) result.state else x.region
         Data(x.location_string , city,country,region)
       }else x
    })