Python 读取excel工作表时发生异常_Python_Pandas_Pyspark_Hdfs

Python 读取excel工作表时发生异常

python pandas pyspark

Python 读取excel工作表时发生异常,python,pandas,pyspark,hdfs,Python,Pandas,Pyspark,Hdfs,我正在从excel中读取excel工作表，我需要将该数据作为json存储在HDFS中。对于一些床单，我面临着例外 excel_file = pd.ExcelFile("export_n_moreExportData10846.xls") for sheet_name in excel_file.sheet_names: df = pd.read_excel(excel_file, header=None, squeeze=True, sheet_name=sheet_name) if sheet

我正在从excel中读取excel工作表，我需要将该数据作为json存储在HDFS中。对于一些床单，我面临着例外

excel_file = pd.ExcelFile("export_n_moreExportData10846.xls")
for sheet_name in excel_file.sheet_names:
df = pd.read_excel(excel_file, header=None, squeeze=True, sheet_name=sheet_name)
if sheet_name=='Passed':
    print '**************' + sheet_name + '******************'
    for i, row in df.iterrows():
        data = df.iloc[(i+1):].reset_index(drop=True)
        data.columns = pd.Series(list(df.iloc[i])).str.replace(' ','_')
        break

    for c in data.columns:
        data[c] = pd.to_numeric(data[c], errors='ignore')
    print data #I'm able to print the data

    result1 = sparkSession.createDataFrame(data) #Facing the exception here
    print "inserting data into HDFS..."
    result1.write.mode("append").json(hdfsPath)
    print "inserted data into hdfs"

我面临以下例外

raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

raisetypeerror（“无法合并类型%s和%s”%（类型（a），类型（b）））
TypeError:无法合并类型和

图中显示了数据

这可能是因为某些列在同一列中具有不同的数据类型，pandas可以处理（'object'类型），spark df不能

处理此问题的两种方法：

您可以跳过spark df阶段，将文件df转换为dict（df.to_dict（orient='records'），然后将其读取到RDD并保存（考虑使用json加载和转储转换为正确的json）

将对象列强制转换为字符串（df[col]=df[col].astype（str））

取决于你到底想要什么

对于此数据。fillna（'0'，inplace=True）起作用，因为列中有空记录。

您可以发布您打印的pandas数据框（打印数据）吗？我添加了数据我需要一个样本数据，我添加了数据类型很难使用（需要OCR:），但看一下，您的问题似乎不像我想的那样，相反，“TR defect”不是字符串，它是一个空问题，请尝试df.fillna（0），并让我们知道我使用data.fillna（0）时发生了什么。打印数据时，“TR Defect”列中的值为NaN，但最后一个除外。可能未保存，请尝试data.fillna（'0'，inplace=True）或data=data.filna（'0'）