Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark-显示数据帧中列数据类型的计数_Python_Apache Spark_Pyspark - Fatal编程技术网

Python PySpark-显示数据帧中列数据类型的计数

Python PySpark-显示数据帧中列数据类型的计数,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,如何像使用熊猫数据帧那样查看Spark数据帧中每种数据类型的计数 例如,假设df是一个数据帧: >>> df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): int_col 5 non-null int64 text_col 5 non-null object

如何像使用熊猫数据帧那样查看Spark数据帧中每种数据类型的计数

例如,假设df是一个数据帧:

>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
int_col      5 non-null int64
text_col     5 non-null object
float_col    5 non-null float64
**dtypes: float64(1), int64(1), object(1)**
memory usage: 200.0+ bytes
df.info(verbose=True) 范围索引:5个条目,0到4 数据列(共3列): 整数列5非空整数64 text_col 5非空对象 float_col 5非空float64 **数据类型:float64(1)、int64(1)、object(1)** 内存使用:200.0+字节 我们可以非常清楚地看到每种数据类型的计数。如何使用spark数据帧执行类似操作?也就是说,如何查看有多少列是浮动的,有多少列是int的,有多少列是对象


谢谢

要么
printSchema

import datetime

df = spark.createDataFrame([("", 1.0, 1, True, datetime.datetime.now())])
df.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: double (nullable = true)
 |-- _3: long (nullable = true)
 |-- _4: boolean (nullable = true)
 |-- _5: timestamp (nullable = true)
或检查
d类型

df.dtypes

[('_1', 'string'),
 ('_2', 'double'),
 ('_3', 'bigint'),
 ('_4', 'boolean'),
 ('_5', 'timestamp')]

下面的代码应该可以得到您想要的结果

# create data frame 
df = sqlContext.createDataFrame(
[(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
 (2,'N','Y',2,1,2,3,'N','Y','Y','N'),
 (3,'Y','N',3,1,0,0,'N','N','N','N'),
 (4,'N','Y',5,0,1,0,'N','N','N','Y'),
 (5,'Y','N',2,2,0,1,'Y','N','N','Y'),
 (6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
 (7,'N','N',1,1,3,4,'N','Y','N','Y'),
 (8,'Y','Y',1,1,2,0,'Y','Y','N','N')
],
('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)

# Find data types of data frame
datatypes_List = df.dtypes

# Querying datatypes_List gives you column and its data type as a tuple
datatypes_List
[('id', 'bigint'), ('compatible', 'string'), ('product', 'string'), ('ios', 'bigint'), ('pc', 'bigint'), ('other', 'bigint'), ('devices', 'bigint'), ('customer', 'string'), ('subscriber', 'string'), ('circle', 'string'), ('smb', 'string')]

# create empty dictonary to store output values
dict_count = {}

# Loop statement to count number of times the data type is present in the data frame
for x, y in datatypes_List:
    dict_count[y] = dict_count.get(y, 0) + 1


# query dict_count to find the number of times a data type is present in data frame
dict_count  

我认为最简单的方法是使用集合。计数器:

df=spark.createDataFrame(
[(1,1.2,'foo'),(2,2.3,'bar'),(无,3.4,'baz'),
[“整数列”、“浮点列”、“字符串列”]
)
从收款进口柜台
打印(计数器((x[1]表示df.dtypes中的x)))
#计数器({'double':1,'bigint':1,'string':1})
还有
pyspark.sql.DataFrame.descripe()
方法:

df.descripe().show()
+-------+------------------+------------------+----------+
|摘要|内部列|浮动列|字符串列||
+-------+------------------+------------------+----------+
|计数| 2 | 3 | 3|
|平均值| 1.5 | 2.3 |零|
|STDEV | 0.7071067811865476 | 1.099999999999 |空|
|最小1.2巴|
|最大2 | 3.4 |富|
+-------+------------------+------------------+----------+

请注意,
int\u col
计数为2,因为在本例中,其中一个值为
null

您好,谢谢您的建议!不幸的是,我没有试图计算每列中缺失值的数量。相反,我只是想看看有多少列是float,有多少列是int,还有多少列是objects。嗨,谢谢你的建议!不幸的是,我的dataframe包含数千列,因此我无法逐一查看。有什么方法可以总结数据类型吗?