Java 使用sparkexcel库读取重复列名excel文件时获取异常。如何克服这个问题
我正在使用spark excel(com.crealytics.spark.excel)库读取excel文件。如果excel文件中没有重复列,则库工作正常。如果excel文件中出现任何重复的列名,则引发以下异常 如何克服这个错误 有没有解决办法来克服这个问题 线程“main”org.apache.spark.sql.AnalysisException中出现异常:在数据架构中发现重复列:`net territory`; 位于org.apache.spark.sql.util.SchemaUtils$.checkColumnNameReplication(SchemaUtils.scala:85)Java 使用sparkexcel库读取重复列名excel文件时获取异常。如何克服这个问题,java,apache-spark,spark-excel,Java,Apache Spark,Spark Excel,我正在使用spark excel(com.crealytics.spark.excel)库读取excel文件。如果excel文件中没有重复列,则库工作正常。如果excel文件中出现任何重复的列名,则引发以下异常 如何克服这个错误 有没有解决办法来克服这个问题 线程“main”org.apache.spark.sql.AnalysisException中出现异常:在数据架构中发现重复列:`net territory`; 位于org.apache.spark.sql.util.SchemaUtils
使用spark excel API获取异常。
StructType schema=DataTypes.createStructType(新StructField[]{DataTypes.createStructField(“CGISAI”,DataTypes.StringType,true)),DataTypes.createStructField(“销售区域”,DataTypes.StringType,true)};
SQLContext sqlcxt=新的SQLContext(jsc);
数据集df=sqlcxt.read()
.format(“com.crealytics.spark.excel”)
.option(“路径”、“文件://”+站点信息文件)
.选项(“useHeader”、“true”)
.option(“spark.read.simpleMode”、“true”)
.选项(“treatEmptyValuesAsNulls”、“true”)
.选项(“推断模式”、“错误”)
.选项(“addColorColumns”、“False”)
.选项(“图纸名称”、“图纸1”)
.选项(“startColumn”,22)
.选项(“endColumn”,23)
//.schema(schema)
.load();
返回df;
这是我正在使用的代码。我正在使用com.crealytics.spark.excel中的sparkexcel库。
我希望解决方案能够识别excel文件是否有重复的列。如果有重复列,如何重命名/消除重复列。
解决办法如下:
将.xlsx文件转换为.csv。使用spark默认csv api,可以通过自动重命名来处理重复的列名。
下面是从xlsx转换为csv文件的代码。
/*
*要更改此许可证标题,请在“项目属性”中选择“许可证标题”。
*要更改此模板文件,请选择工具|模板
*然后在编辑器中打开模板。
*/
包com.huawei.java.tools;
/**
*
*@作者Nanaji Jonnadula
*/
导入org.apache.poi.openxml4j.opc.OPCPackage;
导入org.apache.poi.openxml4j.opc.PackageAccess;
导入org.apache.poi.ss.usermodel.DataFormatter;
导入org.apache.poi.ss.util.CellAddress;
导入org.apache.poi.ss.util.CellReference;
导入org.apache.poi.util.SAXHelper;
导入org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
导入org.apache.poi.xssf.eventusermodel.XSSFReader;
导入org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
导入org.apache.poi.xssf.model.StylesTable;
导入org.apache.poi.xssf.usermodel.XSSFComment;
导入org.xml.sax.ContentHandler;
导入org.xml.sax.InputSource;
导入org.xml.sax.SAXException;
导入org.xml.sax.XMLReader;
导入javax.xml.parsers.parserConfiguration异常;
导入java.io.File;
导入java.io.FileOutputStream;
导入java.io.IOException;
导入java.io.InputStream;
导入静态org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
公共类ExcelXlsx2Csv{
私有静态类SheetToCSV实现SheetContentsHandler{
private boolean firstfrow=false;
private int currentRow=-1;
private int currentCol=-1;
私有StringBuffer lineBuffer=新StringBuffer();
/***数据的目的地*/
私有文件输出流输出流;
公共SheetToCSV(文件输出流输出流){
this.outputStream=outputStream;
}
@凌驾
公共void startRow(int rowNum){
/***如果存在间隙,则输出缺少的行:*outputMissingRows(rowNum-currentRow-1)*/
//为这一排做准备
firstCellOfRow=真;
currentRow=rowNum;
currentCol=-1;
lineBuffer.delete(0,lineBuffer.length());//清除lineBuffer
}
@凌驾
公共void endRow(int rowNum){
lineBuffer.append('\n');
试一试{
write(lineBuffer.substring(0.getBytes());
}捕获(IOE异常){
System.out.println(“将日期保存到文件错误,行号:{}”+currentCol);
抛出新的RuntimeException(“将日期保存到文件错误,行号:“+currentCol”);
}
}
@凌驾
公共空单元格(字符串cellReference、字符串formattedValue、XSSFComment注释){
如果(第一次){
FIRSTCELFROW=假;
}否则{
lineBuffer.append(',');
}
//这里以与XSSFCell类似的方式优雅地处理缺少的CellRef
if(cellReference==null){
cellReference=新的CellAddress(currentRow,currentCol).formatAsString();
}
int thisCol=(新的CellReference(CellReference)).getCol();
int missedCols=thisCol-currentCol-1;
如果(missedCols>1){
lineBuffer.append(',');
}
currentCol=此Col;
如果(formattedValue.contains(“\n”){
formattedValue=formattedValue.replace(“\n”和“”);
}
formattedValue=“\”+formattedValue+“\”;
lineBuffer.append(formattedValue);
}
@凌驾
public void headerFooter(字符串文本、布尔isHeader、字符串标记名){
//跳过,CSV中没有页眉或页脚
}
}
私有静态void进程表(StylesTable样式、ReadOnlySharedStringsTable字符串、,
SheetContentsHandler sheetHandler,InputStream sheetInputStream)引发异常{
DataFormatter formatter=新的DataFormatter();
InputSource sheetSource=新的InputSource(SheetInput
Using spark excel API getting exception .
StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("CGISAI", DataTypes.StringType, true), DataTypes.createStructField("SALES TERRITORY", DataTypes.StringType, true)});
SQLContext sqlcxt = new SQLContext(jsc);
Dataset<Row> df = sqlcxt.read()
.format("com.crealytics.spark.excel")
.option("path", "file:///"+siteinfofile)
.option("useHeader", "true")
.option("spark.read.simpleMode", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "false")
.option("addColorColumns", "False")
.option("sheetName", "sheet1")
.option("startColumn", 22)
.option("endColumn", 23)
//.schema(schema)
.load();
return df;
This is the code I am using. I am using sparkexcel library from com.crealytics.spark.excel.
I want the solution to identify whether excel file has duplicate columns or not. if have duplicate columns, how to rename/eliminate the duplicate columns.
WorkAround is as below:
convert .xlsx file into .csv . using spark default csv api that can handle duplicate column names by renaming them automatically.
Below is the code to convert from xlsx to csv file.
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.huawei.java.tools;
/**
*
* @author Nanaji Jonnadula
*/
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import static org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
public class ExcelXlsx2Csv {
private static class SheetToCSV implements SheetContentsHandler {
private boolean firstCellOfRow = false;
private int currentRow = -1;
private int currentCol = -1;
private StringBuffer lineBuffer = new StringBuffer();
/** * Destination for data */
private FileOutputStream outputStream;
public SheetToCSV(FileOutputStream outputStream) {
this.outputStream = outputStream;
}
@Override
public void startRow(int rowNum) {
/** * If there were gaps, output the missing rows: * outputMissingRows(rowNum - currentRow - 1); */
// Prepare for this row
firstCellOfRow = true;
currentRow = rowNum;
currentCol = -1;
lineBuffer.delete(0, lineBuffer.length()); //clear lineBuffer
}
@Override
public void endRow(int rowNum) {
lineBuffer.append('\n');
try {
outputStream.write(lineBuffer.substring(0).getBytes());
} catch (IOException e) {
System.out.println("save date to file error at row number: {}"+ currentCol);
throw new RuntimeException("save date to file error at row number: " + currentCol);
}
}
@Override
public void cell(String cellReference, String formattedValue, XSSFComment comment) {
if (firstCellOfRow) {
firstCellOfRow = false;
} else {
lineBuffer.append(',');
}
// gracefully handle missing CellRef here in a similar way as XSSFCell does
if (cellReference == null) {
cellReference = new CellAddress(currentRow, currentCol).formatAsString();
}
int thisCol = (new CellReference(cellReference)).getCol();
int missedCols = thisCol - currentCol - 1;
if (missedCols > 1) {
lineBuffer.append(',');
}
currentCol = thisCol;
if (formattedValue.contains("\n")) {
formattedValue = formattedValue.replace("\n", "");
}
formattedValue = "\"" + formattedValue + "\"";
lineBuffer.append(formattedValue);
}
@Override
public void headerFooter(String text, boolean isHeader, String tagName) {
// Skip, no headers or footers in CSV
}
}
private static void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings,
SheetContentsHandler sheetHandler, InputStream sheetInputStream) throws Exception {
DataFormatter formatter = new DataFormatter();
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(
styles, null, strings, sheetHandler, formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch (ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
public static void process(String srcFile, String destFile,String sheetname_) throws Exception {
File xlsxFile = new File(srcFile);
OPCPackage xlsxPackage = OPCPackage.open(xlsxFile.getPath(), PackageAccess.READ);
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(xlsxPackage);
XSSFReader xssfReader = new XSSFReader(xlsxPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
int index = 0;
while (iter.hasNext()) {
InputStream stream = iter.next();
String sheetName = iter.getSheetName();
System.out.println(sheetName + " [index=" + index + "]");
if(sheetName.equals(sheetname_)){
FileOutputStream fileOutputStream = new FileOutputStream(destFile);
processSheet(styles, strings, new SheetToCSV(fileOutputStream), stream);
fileOutputStream.flush();
fileOutputStream.close();
}
stream.close();
++index;
}
xlsxPackage.close();
}
public static void main(String[] args) throws Exception {
ExcelXlsx2Csv.process("D:\\data\\latest.xlsx", "D:\\data\\latest.csv","sheet1"); //source , destination, sheetname
}
}