Apache spark 将字符串强制转换为int null问题

Apache spark 将字符串强制转换为int null问题,apache-spark,pyspark,Apache Spark,Pyspark,我有一个spark数据框results,它有两个字符串列,我想转换为数值: >>> results.show() +--------------------+-----------------+------------------------+ | Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| +--------------------+-----------------+------------

我有一个spark数据框results,它有两个字符串列,我想转换为数值:

>>> results.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             "43"|                    "20"|
|"BAYLOR MEDICAL C...|             "32"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"MASONIC HOME AND...|  "Not Available"|         "Not Available"|
|"ST HELENA HOSPITAL"|             "41"|                    "20"|
|   "TOURO INFIRMARY"|             "15"|                    "18"|
|"WAHIAWA GENERAL ...|             "17"|                    "10"|
|"ANNA JAQUES HOSP...|             "27"|                    "18"|
|    "CMC-BLUE RIDGE"|             "31"|                    "18"|
|"EVANSTON REGIONA...|             "15"|                    "15"|
|"OKLAHOMA SPINE H...|             "79"|                    "20"|
|"PICKENS COUNTY M...|  "Not Available"|         "Not Available"|
|"PORTNEUF MEDICAL...|             "11"|                    "17"|
|"PRESENCE SAINT J...|             "20"|                    "17"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"SOUTH GEORGIA ME...|    "3 out of 10"|                    "24"|
|"TAMPA GENERAL HO...|             "23"|                    "16"|
+--------------------+-----------------+------------------------+
尝试这样做会给我一个空值表:

>>> results2 = results.select( results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score") )
>>> results2.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             null|                    null|
|"BAYLOR MEDICAL C...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"MASONIC HOME AND...|             null|                    null|
|"ST HELENA HOSPITAL"|             null|                    null|
|   "TOURO INFIRMARY"|             null|                    null|
|"WAHIAWA GENERAL ...|             null|                    null|
|"ANNA JAQUES HOSP...|             null|                    null|
|    "CMC-BLUE RIDGE"|             null|                    null|
|"EVANSTON REGIONA...|             null|                    null|
|"OKLAHOMA SPINE H...|             null|                    null|
|"PICKENS COUNTY M...|             null|                    null|
|"PORTNEUF MEDICAL...|             null|                    null|
|"PRESENCE SAINT J...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"SOUTH GEORGIA ME...|             null|                    null|
|"TAMPA GENERAL HO...|             null|                    null|
+--------------------+-----------------+------------------------+

only showing top 20 rows

在pyspark中不能将字符串列强制转换为整数吗?

首先,最好去掉双引号,然后才能转换为整数类型。您可以使用下面的自定义项来完成它

>>> def stripDQ(string):
...  return string.replace('"', "")
... 
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType, IntegerType
>>> udf_stripDQ = udf(stripDQ, StringType())
我们将使用它

您的实际数据帧:

>>> results.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|             "43"|                    "20"|
|"BAYLOR MEDICAL C"|             "32"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"MASONIC HOME AND"|  "Not Available"|         "Not Available"|
+------------------+-----------------+------------------------+
现在,我们将使用udf从两列中去掉双引号

>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"]) ).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"]) )
>>> results1.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|               43|                      20|
|"BAYLOR MEDICAL C"|               32|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"MASONIC HOME AND"|    Not Available|           Not Available|
+------------------+-----------------+------------------------+
现在转换为整数:

>>> results2 = results1.select( results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score") )
>>> results2.show()
+------------------+-----------------+---------------------+
|     Hospital Name|HCAHPS Base Score|HPS Consistency Score|
+------------------+-----------------+---------------------+
|"ADIRONDACK MEDIC"|               43|                   20|
|"BAYLOR MEDICAL C"|               32|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"MASONIC HOME AND"|             null|                 null|
+------------------+-----------------+---------------------+