Python 3.x 数据分析-如何计算Null、NaN和空字符串值?
我是pyspark的新手,我有一个示例数据集:Python 3.x 数据分析-如何计算Null、NaN和空字符串值?,python-3.x,pandas,pyspark,Python 3.x,Pandas,Pyspark,我是pyspark的新手,我有一个示例数据集: Ticker_Modelo Ticker Type Period Product Geography Source Unit Test 0 Model1_Index Model1 Index NWE Forties Hydrocraking Daily Refinery Margins NWE Bloomberg None 3 1 Model2_Index Model2 Index NWE
Ticker_Modelo Ticker Type Period Product Geography Source Unit Test
0 Model1_Index Model1 Index NWE Forties Hydrocraking Daily Refinery Margins NWE Bloomberg None 3
1 Model2_Index Model2 Index NWE Bonny Light Hydrocraking Daily Refinery Margins NWE Bloomberg None 5
2 Model3_Index Model3 Index USGC LLS FCC Daily Refinery Margins USGC Bloomberg None 12
3 Model4_Index Model4 Index USGC Maya Coking Daily Refinery Margins USGC Bloomberg None 67
4 Model6_Index Model6 Index USMC WTI FCC Daily Refinery Margins USMC Bloomberg None 45
5 Model5_Index Model5 Index USMC WCSS Coking Daily Refinery Margins USMC Bloomberg None 22
6 Model7_Index Model7 Index USEC Hibernia FCC Daily Refinery Margins USEC Bloomberg None
7 Model8_Index Model8 Index Singapore Dubai Hydrocracking Daily Refinery Margins Singapore Bloomberg None Null
我需要做一个数据分析并将其存储在数据库中
我试过使用Optimus()和panda_profiler(),但它们进行了分析,并提供了一个HTML,我需要一些值,但它们无法计算
我需要计算每列中有多少null/nan/empty字符串,并用它创建一个新表
我用熊猫和Pypark
我找到了一个我认为有帮助的答案,但是当我尝试将它应用到一个专栏中去尝试时
data_df.filter((data_df["Ticker_Modelo"] == "") | data_df["Ticker_Modelo"].isNull() | isnan(data_df["Ticker_Modelo"])).count()
它给了我一个错误:AttributeError:'Series'对象没有属性'isNull'
然后我不知道如何将其应用于所有列,并将其转置以获得如下结果:
Count_nulls
Ticker_Modelo 0
Ticker 0
Type 0
Period 0
Product 0
Geography 0
Source 0
Unit 0
Test 2
您可以执行以下操作: 首先将所有Null/None值更改为Panda NaN的值
df.replace(['None','Null'],np.nan)
df.isnull().sum(axis=0).to_frame().rename(columns={0 : 'Count_Nulls'})