在sparksql中用LIMIT描述

在sparksql中用LIMIT描述,sql,pyspark,apache-spark-sql,pyspark-sql,sql-limit,Sql,Pyspark,Apache Spark Sql,Pyspark Sql,Sql Limit,我使用descripe关键字获取有关临时视图的列信息。这是一个有用的方法,但我有一个表,我只想描述其中的一个子集列。我试图使用极限和描述来实现这一点,但无法理解 以下是使用pyspark创建的玩具数据集: # make some test data columns = ['id', 'dogs', 'cats', 'horses', 'people'] vals = [ (1, 2, 0, 4, 3), (2, 0, 1, 2, 4) ] # create DataFram

我使用descripe关键字获取有关临时视图的列信息。这是一个有用的方法,但我有一个表,我只想描述其中的一个子集列。我试图使用极限和描述来实现这一点,但无法理解

以下是使用pyspark创建的玩具数据集:

# make some test data
columns = ['id', 'dogs', 'cats', 'horses', 'people']
vals = [
     (1, 2, 0, 4, 3),
     (2, 0, 1, 2, 4)
]

# create DataFrame
df = spark.createDataFrame(vals, columns)
df.createOrReplaceTempView('df')
现在用sql描述:

%%sql

DESCRIBE df
输出:

col_name    data_type
id          bigint
dogs        bigint
cats        bigint
horses      bigint
people      bigint
实际上,我有比这更多的列,我想做的是限制这个查询的输出。以下是我尝试过的几件事:

尝试1:

DESCRIBE df
LIMIT 3
错误:

An error was encountered:
"\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
    raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: "\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"
An error was encountered:
'Table or view not found: DESCRIBE; line 4 pos 4'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Table or view not found: DESCRIBE; line 4 pos 4'
错误:

An error was encountered:
"\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
    raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: "\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"
An error was encountered:
'Table or view not found: DESCRIBE; line 4 pos 4'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Table or view not found: DESCRIBE; line 4 pos 4'

有人知道是否可以限制descripe的输出吗?

下面是一种使用pyspark.sql.dataframe.limit限制descripe输出的方法。使用pyspark.sql.context.sql运行描述查询。这将以数据帧的形式返回结果,您可以调用limit:

df.RegisterEmptable'df' sql'descripe df'.limit3.show +----+-----+----+ |列名称|数据类型|注释| +----+-----+----+ |id | bigint | null| |狗|比基特|零| |猫| bigint |空| +----+-----+----+ 但是,如果只是查找列的数据类型,则可以使用DataFrame的dtypes属性:

df.dtypes ['id','bigint', “狗”,“比金特”, “猫”,“比金特”, “马”,“比金”, '人','比金'] 这是一个元组列表,可以根据需要对其进行切片:

df.d类型[0:3] ['id'、'bigint'、'dogs'、'bigint'、'cats'、'bigint'] 对于返回摘要统计信息的数据帧,还有一个descripe方法:

描述展示 +----+---------+---------+---------+---------+---------+ |摘要|身份|狗|猫|马|人| +----+---------+---------+---------+---------+---------+ |计数| 2 | 2 | 2 | 2| |平均值| 1.5 | 1.0 | 0.5 | 3.0 | 3.5| |STDEV | 0.7071067811865476 | 1.4142135623730951 | 0.7071067811865476 | 1.4142135623730951 | 0.7071067811865476| |最小1 | 0 | 0 | 2 | 3| |最大2 | 2 | 1 | 4 | 4| +----+---------+---------+---------+---------+---------+ 如果要限制列,可以使用select并指定df.columns的切片:

df.selectdf.columns[0:3]。description.show +----+---------+---------+---------+ |总结| id |狗|猫| +----+---------+---------+---------+ |计数| 2 | 2 | 2| |平均值| 1.5 | 1.0 | 0.5| |STDEV | 0.7071067811865476 | 1.4142135623730951 | 0.7071067811865476| |最低1 | 0 | 0| |最大2 | 2 | 1| +----+---------+---------+---------+