描述DataFrame上的vs printSchema方法_Dataframe_Apache Spark_Pyspark_Schema_Describe

描述DataFrame上的vs printSchema方法

dataframe apache-spark pyspark

描述DataFrame上的vs printSchema方法,dataframe,apache-spark,pyspark,schema,describe,Dataframe,Apache Spark,Pyspark,Schema,Describe,我在pyspark中运行这段代码，Descripte和printSchema之间的输出差异令人困惑。请看下面的代码 descripe（）以字符串形式给出分数列，其中当我不使用括号进行描述或使用printSchema（）时，它以int形式给出分数列，实际上就是int。这是我的数据帧 >>> df.show() +-------+------+-----+ | name|course|score| +-------+------+-----+ | fsdhfu| a|

我在pyspark中运行这段代码，Descripte和printSchema之间的输出差异令人困惑。请看下面的代码

descripe（）以字符串形式给出分数列，其中当我不使用括号进行描述或使用printSchema（）时，它以int形式给出分数列，实际上就是int。

这是我的数据帧

>>> df.show()
+-------+------+-----+
|   name|course|score|
+-------+------+-----+
| fsdhfu|     a|   56|
| sdjjfd|     a|   57|
|kljsjlk|     b|   23|
|  udjkx|     b|   89|
|    ias|     c|   36|
| jksdkj|     c|   37|
|  usdkj|     d|   48|
+-------+------+-----+

使用描述：

>>> df2.describe()
DataFrame[summary: string, name: string, course: string, score: string]
>>> df2.describe
<bound method DataFrame.describe of DataFrame[name: string, course: string, score: int]>

>>> df2.printSchema()
root
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- score: integer (nullable = true)

它们之间的区别在于schema提供了有关列的信息，如列的名称及其数据类型，而descripe提供了关于数据集的统计信息。以下内容摘自spark docs，描述如下：

/**
   * Computes basic statistics for numeric and string columns, including count, mean, stddev, min,
   * and max. If no columns are given, this function computes statistics for all numerical or
   * string columns.
   *
   * This function is meant for exploratory data analysis, as we make no guarantee about the
   * backward compatibility of the schema of the resulting Dataset. If you want to
   * programmatically compute summary statistics, use the `agg` function instead.
   *
   * {{{
   *   ds.describe("age", "height").show()
   *
   *   // output:
   *   // summary age   height
   *   // count   10.0  10.0
   *   // mean    53.3  178.05
   *   // stddev  11.6  15.7
   *   // min     18.0  163.0
   *   // max     92.0  192.0
   * }}}
   *
   * Use [[summary]] for expanded statistics and control over which statistics to compute.
   *

在python中。foo.bar和foo.bar（）都是有效语句，其中foo是一个对象，bar是在表示对象foo的类中定义的方法。在前一种情况下，您访问绑定到foo对象的方法，但没有调用该方法

现在来看Pypark。语句df2.descripe告诉我们，它在df2数据帧上找到了一个名为descripe的方法，这是正确的。调用df2.descripe（）中的descripe方法时，会得到一个新的数据帧。您必须对结果调用show方法来获取与原始数据帧关联的统计信息。我建议连续运行以下3个命令

df2.describe
df2.describe()
df2.describe().show()

descripe（）不是数据集的模式，它包含数据集的统计信息（max、min、mean、stedv等）。它甚至还有一个名为summary的附加列。SparkSQL以字符串格式将这些数据存储为SQPRK SQL数据帧。printSchema（）引用列的类型。所以这两个人是完全一样的。

嘿，谢谢。实际上，我没有问真正的问题，即“descripe（）以字符串形式给出分数列，其中当我不使用括号进行描述或使用printSchema（）时，它以int形式给出分数列，实际上是这样的。”。我现在已经更新了。