Java 使用sparkexcel库读取重复列名excel文件时获取异常。如何克服这个问题

Java 使用sparkexcel库读取重复列名excel文件时获取异常。如何克服这个问题,java,apache-spark,spark-excel,Java,Apache Spark,Spark Excel,我正在使用spark excel(com.crealytics.spark.excel)库读取excel文件。如果excel文件中没有重复列,则库工作正常。如果excel文件中出现任何重复的列名,则引发以下异常 如何克服这个错误 有没有解决办法来克服这个问题 线程“main”org.apache.spark.sql.AnalysisException中出现异常:在数据架构中发现重复列:`net territory`; 位于org.apache.spark.sql.util.SchemaUtils

我正在使用spark excel(com.crealytics.spark.excel)库读取excel文件。如果excel文件中没有重复列,则库工作正常。如果excel文件中出现任何重复的列名,则引发以下异常

如何克服这个错误

有没有解决办法来克服这个问题

线程“main”org.apache.spark.sql.AnalysisException中出现异常:在数据架构中发现重复列:`net territory`; 位于org.apache.spark.sql.util.SchemaUtils$.checkColumnNameReplication(SchemaUtils.scala:85)

使用spark excel API获取异常。
StructType schema=DataTypes.createStructType(新StructField[]{DataTypes.createStructField(“CGISAI”,DataTypes.StringType,true)),DataTypes.createStructField(“销售区域”,DataTypes.StringType,true)};
SQLContext sqlcxt=新的SQLContext(jsc);
数据集df=sqlcxt.read()
.format(“com.crealytics.spark.excel”)
.option(“路径”、“文件://”+站点信息文件)
.选项(“useHeader”、“true”)
.option(“spark.read.simpleMode”、“true”)
.选项(“treatEmptyValuesAsNulls”、“true”)
.选项(“推断模式”、“错误”)
.选项(“addColorColumns”、“False”)
.选项(“图纸名称”、“图纸1”)
.选项(“startColumn”,22)
.选项(“endColumn”,23)
//.schema(schema)
.load();
返回df;
这是我正在使用的代码。我正在使用com.crealytics.spark.excel中的sparkexcel库。
我希望解决方案能够识别excel文件是否有重复的列。如果有重复列,如何重命名/消除重复列。
解决办法如下:
将.xlsx文件转换为.csv。使用spark默认csv api,可以通过自动重命名来处理重复的列名。
下面是从xlsx转换为csv文件的代码。
/*
*要更改此许可证标题,请在“项目属性”中选择“许可证标题”。
*要更改此模板文件,请选择工具|模板
*然后在编辑器中打开模板。
*/
包com.huawei.java.tools;
/**
*
*@作者Nanaji Jonnadula
*/
导入org.apache.poi.openxml4j.opc.OPCPackage;
导入org.apache.poi.openxml4j.opc.PackageAccess;
导入org.apache.poi.ss.usermodel.DataFormatter;
导入org.apache.poi.ss.util.CellAddress;
导入org.apache.poi.ss.util.CellReference;
导入org.apache.poi.util.SAXHelper;
导入org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
导入org.apache.poi.xssf.eventusermodel.XSSFReader;
导入org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
导入org.apache.poi.xssf.model.StylesTable;
导入org.apache.poi.xssf.usermodel.XSSFComment;
导入org.xml.sax.ContentHandler;
导入org.xml.sax.InputSource;
导入org.xml.sax.SAXException;
导入org.xml.sax.XMLReader;
导入javax.xml.parsers.parserConfiguration异常;
导入java.io.File;
导入java.io.FileOutputStream;
导入java.io.IOException;
导入java.io.InputStream;
导入静态org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
公共类ExcelXlsx2Csv{
私有静态类SheetToCSV实现SheetContentsHandler{
private boolean firstfrow=false;
private int currentRow=-1;
private int currentCol=-1;
私有StringBuffer lineBuffer=新StringBuffer();
/***数据的目的地*/
私有文件输出流输出流;
公共SheetToCSV(文件输出流输出流){
this.outputStream=outputStream;
}
@凌驾
公共void startRow(int rowNum){
/***如果存在间隙,则输出缺少的行:*outputMissingRows(rowNum-currentRow-1)*/
//为这一排做准备
firstCellOfRow=真;
currentRow=rowNum;
currentCol=-1;
lineBuffer.delete(0,lineBuffer.length());//清除lineBuffer
}
@凌驾
公共void endRow(int rowNum){
lineBuffer.append('\n');
试一试{
write(lineBuffer.substring(0.getBytes());
}捕获(IOE异常){
System.out.println(“将日期保存到文件错误,行号:{}”+currentCol);
抛出新的RuntimeException(“将日期保存到文件错误,行号:“+currentCol”);
}
}
@凌驾
公共空单元格(字符串cellReference、字符串formattedValue、XSSFComment注释){
如果(第一次){
FIRSTCELFROW=假;
}否则{
lineBuffer.append(',');
}
//这里以与XSSFCell类似的方式优雅地处理缺少的CellRef
if(cellReference==null){
cellReference=新的CellAddress(currentRow,currentCol).formatAsString();
}
int thisCol=(新的CellReference(CellReference)).getCol();
int missedCols=thisCol-currentCol-1;
如果(missedCols>1){
lineBuffer.append(',');
}
currentCol=此Col;
如果(formattedValue.contains(“\n”){
formattedValue=formattedValue.replace(“\n”和“”);
}
formattedValue=“\”+formattedValue+“\”;
lineBuffer.append(formattedValue);
}
@凌驾
public void headerFooter(字符串文本、布尔isHeader、字符串标记名){
//跳过,CSV中没有页眉或页脚
}
}
私有静态void进程表(StylesTable样式、ReadOnlySharedStringsTable字符串、,
SheetContentsHandler sheetHandler,InputStream sheetInputStream)引发异常{
DataFormatter formatter=新的DataFormatter();
InputSource sheetSource=新的InputSource(SheetInput
Using spark excel API getting exception . 
    StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("CGISAI", DataTypes.StringType, true), DataTypes.createStructField("SALES TERRITORY", DataTypes.StringType, true)});
    SQLContext sqlcxt = new SQLContext(jsc);
    Dataset<Row> df = sqlcxt.read()
            .format("com.crealytics.spark.excel")
            .option("path", "file:///"+siteinfofile)
            .option("useHeader", "true")
            .option("spark.read.simpleMode", "true")
            .option("treatEmptyValuesAsNulls", "true")
            .option("inferSchema", "false")
            .option("addColorColumns", "False")
            .option("sheetName", "sheet1")
            .option("startColumn", 22)
            .option("endColumn", 23)
            //.schema(schema)
            .load();
    return df;


This is the code I am using. I am using sparkexcel library from com.crealytics.spark.excel. 

I want the solution to identify whether excel file has duplicate columns or not. if have duplicate columns, how to rename/eliminate the duplicate columns.




WorkAround is as below:
convert .xlsx file into .csv . using spark default csv api that can handle duplicate column names by renaming them automatically.

Below is the code to convert from xlsx to csv file.


/*
 * To change this license header, choose License Headers in Project Properties.
 * To change this template file, choose Tools | Templates
 * and open the template in the editor.
 */
package com.huawei.java.tools;

/**
 *
 * @author Nanaji Jonnadula
 */

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;

import static org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
public class ExcelXlsx2Csv {


    private static class SheetToCSV implements SheetContentsHandler {
        private boolean firstCellOfRow = false;
        private int currentRow = -1;
        private int currentCol = -1;

        private StringBuffer lineBuffer = new StringBuffer();

        /** * Destination for data */
        private FileOutputStream outputStream;

        public SheetToCSV(FileOutputStream outputStream) {
            this.outputStream = outputStream;
        }

        @Override
        public void startRow(int rowNum) {
            /** * If there were gaps, output the missing rows: * outputMissingRows(rowNum - currentRow - 1); */
            // Prepare for this row
            firstCellOfRow = true;
            currentRow = rowNum;
            currentCol = -1;

            lineBuffer.delete(0, lineBuffer.length());  //clear lineBuffer
        }

        @Override
        public void endRow(int rowNum) {
            lineBuffer.append('\n');
            try {
                outputStream.write(lineBuffer.substring(0).getBytes());
            } catch (IOException e) {
                System.out.println("save date to file error at row number: {}"+ currentCol);
                throw new RuntimeException("save date to file error at row number: " + currentCol);
            }
        }

        @Override
        public void cell(String cellReference, String formattedValue, XSSFComment comment) {
            if (firstCellOfRow) {
                firstCellOfRow = false;
            } else {
                lineBuffer.append(',');
            }

            // gracefully handle missing CellRef here in a similar way as XSSFCell does
            if (cellReference == null) {
                cellReference = new CellAddress(currentRow, currentCol).formatAsString();
            }

            int thisCol = (new CellReference(cellReference)).getCol();
            int missedCols = thisCol - currentCol - 1;
            if (missedCols > 1) {   
                lineBuffer.append(',');
            }
            currentCol = thisCol;
            if (formattedValue.contains("\n")) {    
                formattedValue = formattedValue.replace("\n", "");
            }
            formattedValue = "\"" + formattedValue + "\"";  
            lineBuffer.append(formattedValue);
        }

        @Override
        public void headerFooter(String text, boolean isHeader, String tagName) {
            // Skip, no headers or footers in CSV
        }
    }



    private static void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings,
                                    SheetContentsHandler sheetHandler, InputStream sheetInputStream) throws Exception {
        DataFormatter formatter = new DataFormatter();
        InputSource sheetSource = new InputSource(sheetInputStream);
        try {
            XMLReader sheetParser = SAXHelper.newXMLReader();
            ContentHandler handler = new XSSFSheetXMLHandler(
                    styles, null, strings, sheetHandler, formatter, false);
            sheetParser.setContentHandler(handler);
            sheetParser.parse(sheetSource);
        } catch (ParserConfigurationException e) {
            throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
        }
    }


    public static void process(String srcFile, String destFile,String sheetname_) throws Exception {
        File xlsxFile = new File(srcFile);
        OPCPackage xlsxPackage = OPCPackage.open(xlsxFile.getPath(), PackageAccess.READ);
        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(xlsxPackage);
        XSSFReader xssfReader = new XSSFReader(xlsxPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            InputStream stream = iter.next();
            String sheetName = iter.getSheetName();
            System.out.println(sheetName + " [index=" + index + "]");
            if(sheetName.equals(sheetname_)){
            FileOutputStream fileOutputStream = new FileOutputStream(destFile);
            processSheet(styles, strings, new SheetToCSV(fileOutputStream), stream);
            fileOutputStream.flush();
            fileOutputStream.close();            
            }
            stream.close();

            ++index;
        }
        xlsxPackage.close();
    }

    public static void main(String[] args) throws Exception {
        ExcelXlsx2Csv.process("D:\\data\\latest.xlsx", "D:\\data\\latest.csv","sheet1"); //source , destination, sheetname
    }
}