使用pyspark显示标记
在我的数据框中有两列具有多个唯一值(种族、状态),我希望看到出现率最高的值,并将其整齐地显示出来。基本上看起来像: LeastFreq种族(事件)MostFreq种族(事件),LeastFreq状态(事件),MostFreq状态(事件) 这是我的代码,但有点不正常:TypeError:%d格式:需要数字,而不是str使用pyspark显示标记,pyspark,markdown,Pyspark,Markdown,在我的数据框中有两列具有多个唯一值(种族、状态),我希望看到出现率最高的值,并将其整齐地显示出来。基本上看起来像: LeastFreq种族(事件)MostFreq种族(事件),LeastFreq状态(事件),MostFreq状态(事件) 这是我的代码,但有点不正常:TypeError:%d格式:需要数字,而不是str print ("Most and least frequent occurrences for age and income columns:") ethnic
print ("Most and least frequent occurrences for age and income columns:")
ethnicDF = datingDF.groupBy("ethnicity").agg(count(lit(1)).alias("Total"))
statusDF = datingDF.groupBy("status").agg(count(lit(1)).alias("Total"))
leastFreqEthnicity = ethnicDF.orderBy(col("Total").asc()).first()
mostFreqEthnicity = ethnicDF.orderBy(col("Total").desc()).first()
leastFreqStatus = statusDF.orderBy(col("Total").asc()).first()
mostFreqStatus = statusDF.orderBy(col("Total").desc()).first()
display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqEthnicity", "MostFreqEthnicity", "leastFreqStatus", "mostFreqStatus", \
" (%d occurrences)" % (leastFreqEthnicity["ethnicity"], leastFreqEthnicity["Total"]), \
" (%d occurrences)" % (mostFreqEthnicity["ethnicity"], mostFreqEthnicity["Total"]), \
" (%d occurrences)" % (leastFreqStatus["status"], leastFreqStatus["Total"]), \
" (%d occurrences)" % (mostFreqStatus["status"], mostFreqStatus["Total"]))))
如果要添加模式,可能需要将“Total”值强制转换为IntegerType。