Python 为什么不';在PySpark中指定的t值?

Python 为什么不';在PySpark中指定的t值?,python,pandas,pyspark,apache-spark-sql,Python,Pandas,Pyspark,Apache Spark Sql,我在Jupyter笔记本中使用PySpark处理一个数据帧,遇到了一个问题,我在想要的列中分配了值,但当我.show()时,数据帧返回了原始值。不知道我做错了什么 我想做的是从Pandas复制LabelEncoder()。这是我使用PandasLabelEncoder()的解决方案: 现在我想使用PySpark做同样的事情,但是PySpark不支持LabelEncoder(),所以我将值分配到每列中。以下是我尝试使用的代码: new_result = result.withColumn('Gen

我在Jupyter笔记本中使用PySpark处理一个数据帧,遇到了一个问题,我在想要的列中分配了值,但当我
.show()
时,数据帧返回了原始值。不知道我做错了什么

我想做的是从Pandas复制
LabelEncoder()
。这是我使用Pandas
LabelEncoder()的解决方案

现在我想使用PySpark做同样的事情,但是PySpark不支持
LabelEncoder()
,所以我将值分配到每列中。以下是我尝试使用的代码:

new_result = result.withColumn('Gender',f.when(f.col('Gender')== 'Male',f.lit(0)).\
when(f.col('Gender')== 'Female',f.lit(1)).\
when(f.col('Gender')== 'Other',f.lit(2)).\
otherwise(f.col('Gender')))


new_result = result.withColumn('Country',f.when(f.col('Country')== 'Russia',f.lit(0)).\
when(f.col('Country')== 'Bahamas The',f.lit(1)).\
when(f.col('Country')== 'Sweden',f.lit(2)).\
when(f.col('Country')== 'Philippines',f.lit(3)).\
when(f.col('Country')== 'Singapore',f.lit(4)).\
when(f.col('Country')== 'Germany',f.lit(5)).\
when(f.col('Country')== 'France',f.lit(6)).\
when(f.col('Country')== 'Greece',f.lit(7)).\
when(f.col('Country')== 'Belgium',f.lit(8)).\
when(f.col('Country')== 'Finland',f.lit(9)).\
when(f.col('Country')== 'United States',f.lit(10)).\
when(f.col('Country')== 'India',f.lit(11)).\
when(f.col('Country')== 'China',f.lit(12)).\
when(f.col('Country')== 'Croatia',f.lit(13)).\
when(f.col('Country')== 'Nigeria',f.lit(14)).\
when(f.col('Country')== 'Italy',f.lit(15)).\
when(f.col('Country')== 'Norway',f.lit(16)).\
when(f.col('Country')== 'Spain',f.lit(17)).\
when(f.col('Country')== 'Denmark',f.lit(18)).\
when(f.col('Country')== 'Ireland',f.lit(19)).\
when(f.col('Country')== 'Thailand',f.lit(20)).\
when(f.col('Country')== 'Israel',f.lit(21)).\
when(f.col('Country')== 'Uruguay',f.lit(22)).\
when(f.col('Country')== 'Mexico',f.lit(23)).\
when(f.col('Country')== 'Georgia',f.lit(24)).\
when(f.col('Country')== 'Switzerland',f.lit(25)).\
when(f.col('Country')== 'Latvia',f.lit(26)).\
when(f.col('Country')== 'Canada',f.lit(27)).\
when(f.col('Country')== 'Czech Republic',f.lit(28)).\
when(f.col('Country')== 'Brazil',f.lit(29)).\
when(f.col('Country')== 'Slovenia',f.lit(30)).\
when(f.col('Country')== 'Japan',f.lit(31)).\
when(f.col('Country')== 'New Zealand',f.lit(32)).\
when(f.col('Country')== 'Bosnia and Herzegovina',f.lit(33)).\
when(f.col('Country')== 'Poland',f.lit(34)).\
when(f.col('Country')== 'Portugal',f.lit(35)).\
when(f.col('Country')== 'Australia',f.lit(36)).\
when(f.col('Country')== 'Romania',f.lit(37)).\
when(f.col('Country')== 'Bulgaria',f.lit(38)).\
when(f.col('Country')== 'Austria',f.lit(39)).\
when(f.col('Country')== 'Costa Rica',f.lit(40)).\
when(f.col('Country')== 'South Africa',f.lit(41)).\
when(f.col('Country')== 'Colombia',f.lit(42)).\
when(f.col('Country')== 'Hungary',f.lit(43)).\
when(f.col('Country')== 'United Kingdom',f.lit(44)).\
when(f.col('Country')== 'Moldova',f.lit(45)).\
when(f.col('Country')== 'Netherlands',f.lit(46)).\
otherwise(f.col('Country')))


new_result = result.withColumn('self_employed',f.when(f.col('self_employed')== 'NA',f.lit(0)).\
when(f.col('self_employed')== 'No',f.lit(1)).\
when(f.col('self_employed')== 'Yes',f.lit(2)).\
otherwise(f.col('self_employed')))


new_result = result.withColumn('family_history',f.when(f.col('family_history')== 'No',f.lit(0)).\
when(f.col('family_history')== 'Yes',f.lit(1)).\
otherwise(f.col('family_history')))


new_result = result.withColumn('treatment',f.when(f.col('treatment')== 'No',f.lit(0)).\
when(f.col('treatment')== 'Yes',f.lit(1)).\
otherwise(f.col('treatment')))


new_result = result.withColumn('work_interfere',f.when(f.col('work_interfere')== 'Sometimes',f.lit(2)).\
when(f.col('work_interfere')== 'Rarely',f.lit(1)).\
when(f.col('work_interfere')== 'Often',f.lit(3)).\
when(f.col('work_interfere')== 'Never',f.lit(0)).\
otherwise(f.col('work_interfere')))


new_result = result.withColumn('remote_work',f.when(f.col('remote_work')== 'No',f.lit(0)).\
when(f.col('remote_work')== 'Yes',f.lit(1)).\
otherwise(f.col('remote_work')))


new_result = result.withColumn('tech_company',f.when(f.col('tech_company')== 'No',f.lit(0)).\
when(f.col('tech_company')== 'Yes',f.lit(1)).\
otherwise(f.col('tech_company')))


new_result = result.withColumn('benefits',f.when(f.col('benefits')== 'No',f.lit(0)).\
when(f.col('benefits')== 'Yes',f.lit(1)).\
when(f.col('benefits')== "Don't know",f.lit(2)).\
otherwise(f.col('benefits')))


new_result = result.withColumn('care_options',f.when(f.col('care_options')== 'No',f.lit(0)).\
when(f.col('care_options')== 'Yes',f.lit(1)).\
when(f.col('care_options')== "Not sure",f.lit(2)).\
otherwise(f.col('care_options')))


new_result = result.withColumn('wellness_program',f.when(f.col('wellness_program')== 'No',f.lit(0)).\
when(f.col('wellness_program')== 'Yes',f.lit(1)).\
when(f.col('wellness_program')== "Don't know",f.lit(2)).\
otherwise(f.col('wellness_program')))


new_result = result.withColumn('seek_help',f.when(f.col('seek_help')== 'No',f.lit(0)).\
when(f.col('seek_help')== 'Yes',f.lit(1)).\
when(f.col('seek_help')== "Don't know",f.lit(2)).\
otherwise(f.col('seek_help')))


new_result = result.withColumn('anonymity',f.when(f.col('anonymity')== 'No',f.lit(0)).\
when(f.col('anonymity')== 'Yes',f.lit(1)).\
when(f.col('anonymity')== "Don't know",f.lit(2)).\
otherwise(f.col('anonymity')))


new_result = result.withColumn('leave',f.when(f.col('leave')== 'Somewhat difficult',f.lit(0)).\
when(f.col('leave')== 'Somewhat easy',f.lit(1)).\
when(f.col('leave')== "Don't know",f.lit(2)).\
when(f.col('leave')== "Very difficult",f.lit(3)).\
when(f.col('leave')== "Very easy",f.lit(4)).\
otherwise(f.col('leave')))


new_result = result.withColumn('mental_health_consequence',f.when(f.col('mental_health_consequence')== 'No',f.lit(0)).\
when(f.col('mental_health_consequence')== 'Yes',f.lit(1)).\
when(f.col('mental_health_consequence')== "Maybe",f.lit(2)).\
otherwise(f.col('mental_health_consequence')))


new_result = result.withColumn('phys_health_consequence',f.when(f.col('phys_health_consequence')== 'No',f.lit(0)).\
when(f.col('phys_health_consequence')== 'Yes',f.lit(1)).\
when(f.col('phys_health_consequence')== "Maybe",f.lit(2)).\
otherwise(f.col('phys_health_consequence')))


new_result = result.withColumn('coworkers',f.when(f.col('coworkers')== 'No',f.lit(0)).\
when(f.col('coworkers')== 'Yes',f.lit(1)).\
when(f.col('coworkers')== "Some of them",f.lit(2)).\
otherwise(f.col('coworkers')))


new_result = result.withColumn('supervisor',f.when(f.col('supervisor')== 'No',f.lit(0)).\
when(f.col('supervisor')== 'Yes',f.lit(1)).\
when(f.col('supervisor')== "Some of them",f.lit(2)).\
otherwise(f.col('supervisor')))


new_result = result.withColumn('mental_vs_physical',f.when(f.col('mental_vs_physical')== 'No',f.lit(0)).\
when(f.col('mental_vs_physical')== 'Yes',f.lit(1)).\
when(f.col('mental_vs_physical')== "Don't know",f.lit(2)).\
otherwise(f.col('mental_vs_physical')))


new_result = result.withColumn('obs_consequence',f.when(f.col('obs_consequence')== 'No',f.lit(0)).\
when(f.col('obs_consequence')== 'Yes',f.lit(1)).\
otherwise(f.col('obs_consequence')))


new_result = result.withColumn('mental_issue_in_tech',f.when(f.col('mental_issue_in_tech')== False, 0).otherwise(1))
new_result.show()

每次对变量进行编码时,都会覆盖新的\u结果:

# new_result assigned
new_result = result.withColumn('Gender',f.when(f.col('Gender')== 'Male',f.lit(0)).\
when(f.col('Gender')== 'Female',f.lit(1)).\
when(f.col('Gender')== 'Other',f.lit(2)).\
otherwise(f.col('Gender')))

# Previously assigned new_result overwritten!!
new_result = result.withColumn('Country',f.when(f.col('Country')== 'Russia',f.lit(0)).\
when(f.col('Country')== 'Bahamas The',f.lit(1)).\
when(f.col('Country')== 'Sweden',f.lit(2)).\
when(f.col('Country')== 'Philippines',f.lit(3)).\
...
这样做:

# Make new_result point to result
new_result = result

# Now you can reassign to the same df each time
new_result = new_result.withColumn('Gender',f.when(f.col('Gender')== 'Male',f.lit(0)).\
when(f.col('Gender')== 'Female',f.lit(1)).\
when(f.col('Gender')== 'Other',f.lit(2)).\
otherwise(f.col('Gender')))

# Reassigning again...
new_result = new_result.withColumn('Country',f.when(f.col('Country')== 'Russia',f.lit(0)).\
when(f.col('Country')== 'Bahamas The',f.lit(1)).\
when(f.col('Country')== 'Sweden',f.lit(2)).\
when(f.col('Country')== 'Philippines',f.lit(3)).\
...
# Make new_result point to result
new_result = result

# Now you can reassign to the same df each time
new_result = new_result.withColumn('Gender',f.when(f.col('Gender')== 'Male',f.lit(0)).\
when(f.col('Gender')== 'Female',f.lit(1)).\
when(f.col('Gender')== 'Other',f.lit(2)).\
otherwise(f.col('Gender')))

# Reassigning again...
new_result = new_result.withColumn('Country',f.when(f.col('Country')== 'Russia',f.lit(0)).\
when(f.col('Country')== 'Bahamas The',f.lit(1)).\
when(f.col('Country')== 'Sweden',f.lit(2)).\
when(f.col('Country')== 'Philippines',f.lit(3)).\
...