Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将PySpark上的所有函数合并为一列?_Python_Pandas_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何将PySpark上的所有函数合并为一列?

Python 如何将PySpark上的所有函数合并为一列?,python,pandas,apache-spark,pyspark,apache-spark-sql,Python,Pandas,Apache Spark,Pyspark,Apache Spark Sql,目前,我正在尝试将所有功能合并到一个名为“性别”的专栏中。我用熊猫成功地做到了这一点,但现在我想用PySpark做到这一点,它与熊猫有点不同。我无法在PySpark中调用函数。apply 这是我使用熊猫完成的版本: df['Gender'] = df['Gender'].str.lower() male = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail"

目前,我正在尝试将所有功能合并到一个名为“性别”的专栏中。我用熊猫成功地做到了这一点,但现在我想用PySpark做到这一点,它与熊猫有点不同。我无法在PySpark中调用函数
。apply

这是我使用熊猫完成的版本:

df['Gender'] = df['Gender'].str.lower()

male = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"]
female = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female",  "trans woman", "female (trans)"]
other = ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]

new_df['Gender'] = new_df['Gender'].apply(lambda x:"Male" if x in male else x)
new_df['Gender'] = new_df['Gender'].apply(lambda x:"Female" if x in female else x)
new_df['Gender'] = new_df['Gender'].apply(lambda x:"Other" if x in other else x)
这是我试图使用PySpark复制的版本,但我很难将所有转换的值放回“性别”列:

这是我尝试的解决方案的另一个版本,但它给了我一个错误:无法将列转换为bool:

from pyspark.sql.functions import lower, col, udf

na_df = na_df.withColumn('Gender', lower(col('Gender')))

genders = {
    'Male': ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"],
    'Female': ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female",  "trans woman", "female (trans)"],
    'Other': ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]
}

na_df.withColumn('Gender', (lambda x: [g for g in genders if x in genders[g]][0])(col('Gender'))).show()

我得到的结果是,“性别”栏尚未更新,因此请告知我可以做些什么来解决问题。提前谢谢

您可以通过在函数运行时链接来完成此操作

import pyspark.sql.functions as f
+---+----------+
| id|    gender|
+---+----------+
|  1|      male|
|  1|         m|
|  1|  male-ish|
|  1|     maile|
|  1|       mal|
|  1|male (cis)|
|  1|      make|
|  1|     male |
|  1|       man|
|  1|      msle|
|  1|      mail|
|  1|      malr|
|  1|   cis man|
|  1|  cis male|
|  1|cis female|
|  1|         f|
|  1|    female|
|  1|     woman|
|  1|    femake|
|  1|   female |
+---+----------+

df = df.withColumn('gender',f.when(f.col('gender').isin(male),f.lit('Male')).\
when(f.col('gender').isin(other),f.lit('Other')).\
when(f.col('gender').isin(female),f.lit('Female')).\
otherwise(f.col('gender')))


df.select('Gender').distinct().show()
+---+------+
| id|gender|
+---+------+
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
+---+------+

您可以通过在函数运行时链接来实现这一点

import pyspark.sql.functions as f
+---+----------+
| id|    gender|
+---+----------+
|  1|      male|
|  1|         m|
|  1|  male-ish|
|  1|     maile|
|  1|       mal|
|  1|male (cis)|
|  1|      make|
|  1|     male |
|  1|       man|
|  1|      msle|
|  1|      mail|
|  1|      malr|
|  1|   cis man|
|  1|  cis male|
|  1|cis female|
|  1|         f|
|  1|    female|
|  1|     woman|
|  1|    femake|
|  1|   female |
+---+----------+

df = df.withColumn('gender',f.when(f.col('gender').isin(male),f.lit('Male')).\
when(f.col('gender').isin(other),f.lit('Other')).\
when(f.col('gender').isin(female),f.lit('Female')).\
otherwise(f.col('gender')))


df.select('Gender').distinct().show()
+---+------+
| id|gender|
+---+------+
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
+---+------+

您的pandas代码有一个更好的备选bdw(与此问题无关,只是说-不要在这种情况下使用apply),请查看
np。选择
。使用pyspark,您可以尝试使用以下答案所示的h when和oterwise,或者使用selectExpryour pandas代码的case when语句有更好的备选bdw(与此问题无关,只需说-不要使用apply来处理此类情况),看看
np.select
是如何工作的。使用pyspark,您可以尝试使用以下答案所示的h when和oterwise,或者使用带有selectExprth的case when语句。该函数不起作用,我不确定它为什么不起作用。请更新我上面的代码以供审阅。您遇到了什么错误。我希望你导入了import pyspark.sql.functions作为fThere's no error,只是结果仍然一样,没有发生任何事情。它并没有将他们分组,只显示“男性”、“女性”和“其他”。我已经更新了上面的代码以供审阅。你能更新数据框中的数据吗@Shubham Jain Nevermind已经解决了,谢谢。这是我的错误,改变了将男性变为资本男性、女性变为资本女性等价值观。感谢您的帮助,这是一个学习的过程。该功能不起作用,我不知道为什么不起作用。请更新我上面的代码以供审阅。您遇到了什么错误。我希望你导入了import pyspark.sql.functions作为fThere's no error,只是结果仍然一样,没有发生任何事情。它并没有将他们分组,只显示“男性”、“女性”和“其他”。我已经更新了上面的代码以供审阅。你能更新数据框中的数据吗@Shubham Jain Nevermind已经解决了,谢谢。这是我的错误改变价值观,使男性成为资本男性,女性成为资本女性等。谢谢你的帮助,这是一个学习的旅程。