Java 使用sparkexcel库读取重复列名excel文件时获取异常。如何克服这个问题_Java_Apache Spark_Spark Excel

Java 使用sparkexcel库读取重复列名excel文件时获取异常。如何克服这个问题

java apache-spark

Java 使用sparkexcel库读取重复列名excel文件时获取异常。如何克服这个问题,java,apache-spark,spark-excel,Java,Apache Spark,Spark Excel,我正在使用spark excel（com.crealytics.spark.excel）库读取excel文件。如果excel文件中没有重复列，则库工作正常。如果excel文件中出现任何重复的列名，则引发以下异常如何克服这个错误有没有解决办法来克服这个问题线程“main”org.apache.spark.sql.AnalysisException中出现异常：在数据架构中发现重复列：`net territory`；位于org.apache.spark.sql.util.SchemaUtils

我正在使用spark excel（com.crealytics.spark.excel）库读取excel文件。如果excel文件中没有重复列，则库工作正常。如果excel文件中出现任何重复的列名，则引发以下异常

如何克服这个错误

有没有解决办法来克服这个问题

线程“main”org.apache.spark.sql.AnalysisException中出现异常：在数据架构中发现重复列：`net territory`；位于org.apache.spark.sql.util.SchemaUtils$.checkColumnNameReplication（SchemaUtils.scala:85）

使用spark excel API获取异常。
StructType schema=DataTypes.createStructType（新StructField[]{DataTypes.createStructField（“CGISAI”，DataTypes.StringType，true）），DataTypes.createStructField（“销售区域”，DataTypes.StringType，true）}；
SQLContext sqlcxt=新的SQLContext（jsc）；
数据集df=sqlcxt.read（）
.format（“com.crealytics.spark.excel”）
.option（“路径”、“文件：//”+站点信息文件）
.选项（“useHeader”、“true”）
.option（“spark.read.simpleMode”、“true”）
.选项（“treatEmptyValuesAsNulls”、“true”）
.选项（“推断模式”、“错误”）
.选项（“addColorColumns”、“False”）
.选项（“图纸名称”、“图纸1”）
.选项（“startColumn”，22）
.选项（“endColumn”，23）
//.schema（schema）
.load（）；
返回df；
这是我正在使用的代码。我正在使用com.crealytics.spark.excel中的sparkexcel库。
我希望解决方案能够识别excel文件是否有重复的列。如果有重复列，如何重命名/消除重复列。
解决办法如下：
将.xlsx文件转换为.csv。使用spark默认csv api，可以通过自动重命名来处理重复的列名。
下面是从xlsx转换为csv文件的代码。
/*
*要更改此许可证标题，请在“项目属性”中选择“许可证标题”。
*要更改此模板文件，请选择工具|模板
*然后在编辑器中打开模板。
*/
包com.huawei.java.tools；
/**
*
*@作者Nanaji Jonnadula
*/
导入org.apache.poi.openxml4j.opc.OPCPackage；
导入org.apache.poi.openxml4j.opc.PackageAccess；
导入org.apache.poi.ss.usermodel.DataFormatter；
导入org.apache.poi.ss.util.CellAddress；
导入org.apache.poi.ss.util.CellReference；
导入org.apache.poi.util.SAXHelper；
导入org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable；
导入org.apache.poi.xssf.eventusermodel.XSSFReader；
导入org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler；
导入org.apache.poi.xssf.model.StylesTable；
导入org.apache.poi.xssf.usermodel.XSSFComment；
导入org.xml.sax.ContentHandler；
导入org.xml.sax.InputSource；
导入org.xml.sax.SAXException；
导入org.xml.sax.XMLReader；
导入javax.xml.parsers.parserConfiguration异常；
导入java.io.File；
导入java.io.FileOutputStream；
导入java.io.IOException；
导入java.io.InputStream；
导入静态org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler；
公共类ExcelXlsx2Csv{
私有静态类SheetToCSV实现SheetContentsHandler{
private boolean firstfrow=false；
private int currentRow=-1；
private int currentCol=-1；
私有StringBuffer lineBuffer=新StringBuffer（）；
/***数据的目的地*/
私有文件输出流输出流；
公共SheetToCSV（文件输出流输出流）{
this.outputStream=outputStream；
}
@凌驾
公共void startRow（int rowNum）{
/***如果存在间隙，则输出缺少的行：*outputMissingRows（rowNum-currentRow-1）*/
//为这一排做准备
firstCellOfRow=真；
currentRow=rowNum；
currentCol=-1；
lineBuffer.delete（0，lineBuffer.length（））；//清除lineBuffer
}
@凌驾
公共void endRow（int rowNum）{
lineBuffer.append（'\n'）；
试一试{
write（lineBuffer.substring（0.getBytes（））；
}捕获（IOE异常）{
System.out.println（“将日期保存到文件错误，行号：{}”+currentCol）；
抛出新的RuntimeException（“将日期保存到文件错误，行号：“+currentCol”）；
}
}
@凌驾
公共空单元格（字符串cellReference、字符串formattedValue、XSSFComment注释）{
如果（第一次）{
FIRSTCELFROW=假；
}否则{
lineBuffer.append（'，'）；
}
//这里以与XSSFCell类似的方式优雅地处理缺少的CellRef
if（cellReference==null）{
cellReference=新的CellAddress（currentRow，currentCol）.formatAsString（）；
}
int thisCol=（新的CellReference（CellReference））.getCol（）；
int missedCols=thisCol-currentCol-1；
如果（missedCols>1）{
lineBuffer.append（'，'）；
}
currentCol=此Col；
如果（formattedValue.contains（“\n”）{
formattedValue=formattedValue.replace（“\n”和“”）；
}
formattedValue=“\”+formattedValue+“\”；
lineBuffer.append（formattedValue）；
}
@凌驾
public void headerFooter（字符串文本、布尔isHeader、字符串标记名）{
//跳过，CSV中没有页眉或页脚
}
}
私有静态void进程表（StylesTable样式、ReadOnlySharedStringsTable字符串、，
SheetContentsHandler sheetHandler，InputStream sheetInputStream）引发异常{
DataFormatter formatter=新的DataFormatter（）；
InputSource sheetSource=新的InputSource（SheetInput
Using spark excel API getting exception . 
    StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("CGISAI", DataTypes.StringType, true), DataTypes.createStructField("SALES TERRITORY", DataTypes.StringType, true)});
    SQLContext sqlcxt = new SQLContext(jsc);
    Dataset<Row> df = sqlcxt.read()
            .format("com.crealytics.spark.excel")
            .option("path", "file:///"+siteinfofile)
            .option("useHeader", "true")
            .option("spark.read.simpleMode", "true")
            .option("treatEmptyValuesAsNulls", "true")
            .option("inferSchema", "false")
            .option("addColorColumns", "False")
            .option("sheetName", "sheet1")
            .option("startColumn", 22)
            .option("endColumn", 23)
            //.schema(schema)
            .load();
    return df;


This is the code I am using. I am using sparkexcel library from com.crealytics.spark.excel. 

I want the solution to identify whether excel file has duplicate columns or not. if have duplicate columns, how to rename/eliminate the duplicate columns.




WorkAround is as below:
convert .xlsx file into .csv . using spark default csv api that can handle duplicate column names by renaming them automatically.

Below is the code to convert from xlsx to csv file.


/*
 * To change this license header, choose License Headers in Project Properties.
 * To change this template file, choose Tools | Templates
 * and open the template in the editor.
 */
package com.huawei.java.tools;

/**
 *
 * @author Nanaji Jonnadula
 */

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;

import static org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
public class ExcelXlsx2Csv {


    private static class SheetToCSV implements SheetContentsHandler {
        private boolean firstCellOfRow = false;
        private int currentRow = -1;
        private int currentCol = -1;

        private StringBuffer lineBuffer = new StringBuffer();

        /** * Destination for data */
        private FileOutputStream outputStream;

        public SheetToCSV(FileOutputStream outputStream) {
            this.outputStream = outputStream;
        }

        @Override
        public void startRow(int rowNum) {
            /** * If there were gaps, output the missing rows: * outputMissingRows(rowNum - currentRow - 1); */
            // Prepare for this row
            firstCellOfRow = true;
            currentRow = rowNum;
            currentCol = -1;

            lineBuffer.delete(0, lineBuffer.length());  //clear lineBuffer
        }

        @Override
        public void endRow(int rowNum) {
            lineBuffer.append('\n');
            try {
                outputStream.write(lineBuffer.substring(0).getBytes());
            } catch (IOException e) {
                System.out.println("save date to file error at row number: {}"+ currentCol);
                throw new RuntimeException("save date to file error at row number: " + currentCol);
            }
        }

        @Override
        public void cell(String cellReference, String formattedValue, XSSFComment comment) {
            if (firstCellOfRow) {
                firstCellOfRow = false;
            } else {
                lineBuffer.append(',');
            }

            // gracefully handle missing CellRef here in a similar way as XSSFCell does
            if (cellReference == null) {
                cellReference = new CellAddress(currentRow, currentCol).formatAsString();
            }

            int thisCol = (new CellReference(cellReference)).getCol();
            int missedCols = thisCol - currentCol - 1;
            if (missedCols > 1) {   
                lineBuffer.append(',');
            }
            currentCol = thisCol;
            if (formattedValue.contains("\n")) {    
                formattedValue = formattedValue.replace("\n", "");
            }
            formattedValue = "\"" + formattedValue + "\"";  
            lineBuffer.append(formattedValue);
        }

        @Override
        public void headerFooter(String text, boolean isHeader, String tagName) {
            // Skip, no headers or footers in CSV
        }
    }



    private static void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings,
                                    SheetContentsHandler sheetHandler, InputStream sheetInputStream) throws Exception {
        DataFormatter formatter = new DataFormatter();
        InputSource sheetSource = new InputSource(sheetInputStream);
        try {
            XMLReader sheetParser = SAXHelper.newXMLReader();
            ContentHandler handler = new XSSFSheetXMLHandler(
                    styles, null, strings, sheetHandler, formatter, false);
            sheetParser.setContentHandler(handler);
            sheetParser.parse(sheetSource);
        } catch (ParserConfigurationException e) {
            throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
        }
    }


    public static void process(String srcFile, String destFile,String sheetname_) throws Exception {
        File xlsxFile = new File(srcFile);
        OPCPackage xlsxPackage = OPCPackage.open(xlsxFile.getPath(), PackageAccess.READ);
        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(xlsxPackage);
        XSSFReader xssfReader = new XSSFReader(xlsxPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            InputStream stream = iter.next();
            String sheetName = iter.getSheetName();
            System.out.println(sheetName + " [index=" + index + "]");
            if(sheetName.equals(sheetname_)){
            FileOutputStream fileOutputStream = new FileOutputStream(destFile);
            processSheet(styles, strings, new SheetToCSV(fileOutputStream), stream);
            fileOutputStream.flush();
            fileOutputStream.close();            
            }
            stream.close();

            ++index;
        }
        xlsxPackage.close();
    }

    public static void main(String[] args) throws Exception {
        ExcelXlsx2Csv.process("D:\\data\\latest.xlsx", "D:\\data\\latest.csv","sheet1"); //source , destination, sheetname
    }
}