Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/javascript/389.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark:读取带有标题的CSV文件_Csv_Apache Spark_Apache Spark Mllib - Fatal编程技术网

Spark:读取带有标题的CSV文件

Spark:读取带有标题的CSV文件,csv,apache-spark,apache-spark-mllib,Csv,Apache Spark,Apache Spark Mllib,我有一个CSV文件,有90列,大约28000行 我想加载它,并将其分为列(75%)和测试(25%)。我使用了以下代码: 代码: 我在“数据”上得到以下错误: 有什么问题?我在网上搜索,没有找到任何明确的答案 我发现了这个,并按如下方式更改了我的代码: val data = sc.textFile(datadir + "/dados_frontwave_corte_pedra_ferramenta.csv") val parsedData = data.map {line => va

我有一个CSV文件,有90列,大约28000行

我想加载它,并将其分为列(75%)和测试(25%)。我使用了以下代码:

代码:

我在“数据”上得到以下错误:

有什么问题?我在网上搜索,没有找到任何明确的答案

我发现了这个,并按如下方式更改了我的代码:

val data = sc.textFile(datadir + "/dados_frontwave_corte_pedra_ferramenta.csv")
val parsedData = data.map {line =>
    val parts = line.split(",")
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(",").map(x => x.toDouble).toArray))
    };
错误已消失,但当我运行代码时,会产生以下错误:

15/04/15 16:53:52 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NumberFormatException: For input string ""12316""
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at sun.misc.FloatingDecimal.parseDouble(Unknown Source)
at scala.collection.immutable.StringLinke$class.toDouble(StringOps.scala:31)
....
输入文件的示例:

"","Ferramenta","Pedra","ensaio","Nrasgo","Vcorte","Vrotacional","PCorte","Tempo","Fh","Fr","Energia","Caudal","Vib_disco","Vib_maquina","xx","yy","zz","Fonte","id","rocha_classe","rocha_tipo","Resistência_mecânica_à_compressão","Res._mec._à_compr._após_teste_de_gelividade","Resistência_mecânica_à_flexão","Massa_volúmica_aparente","Absorção_de_água_à_P._At.N.","Porosidade_aberta","Coef._de_dilatação_linear_térmica_val._máx","Resistência_ao_desgaste","Resistência_ao_choque_altura_minima_de_queda","Resistência_ao_gelo","Al2O3","CaO","H2O.","K2O","MgO","MnO","Na2O","P2O5","SiO2","TiO2","microclina","plagioclase","quartzo","page_id","rocha_nome_2","P.R._.L.O.I..","plagioclase_.oligoclase.albite.","feldspato_potássico_.microclina.","feldspato_potássico_.essencialmente_microclina.","biotite","rocha_nome_3","oligoclase","plagioclase_.andesina.","horneblenda","feldspato_potássico","nefelina","aegirina_e_aegirina.augite","esfena","piroxena","olivina","horneblenda_verde","plagioclase_.oligoclase.","CO2","clorite","cloritóide","quartzo.feldspato","SO3","cloritóide.clorite","calcite","dolomite","serpentina_.antigorite.crisótilo.","mica_.biotite.moscovite.","feldspato","Fe2O3","plagioclase_ácida","cristobalite","rocha_nome_1","Ferramentas","Binder","LM","graf","WC","T","sigma","epsilon","m","E","H"
"1","A-010-13","estremoz","ECE-E1",5,26,1430,5,6.08,-0.0981,57,720,23.5,0.9,3.5,162,197.2,5,"ECE-A-010-13-Estremoz_1",2,"sedimentares ","calcário",960,767,276,2711,0.07,0.18,11.5,3.4,57.5,48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"A-010/13","A-038/11","LM 156",0,0,800,425,0.0062,1.09,159,2085
"2","A-010-13","estremoz","ECE-E1",5,26,1430,5,5.9,-0.0981,63,720,23.5,0.9,3.5,157,197.2,5,"ECE-A-010-13-Estremoz_1",2,"sedimentares ","calcário",960,767,276,2711,0.07,0.18,11.5,3.4,57.5,48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"A-010/13","A-038/11","LM 156",0,0,800,425,0.0062,1.09,159,2085

也许这是一个奇怪的解决方案,但试试这个:

val parts = line.split(",").map(x => x.replace("\"", "")).filter(x => x.length > 0)

在您的第一个示例中,方法
train
需要一个RDD,您将向它传递一个数组

collect
是一种操作,而不是转换。取消对
collect
的呼叫应该可以解决您的问题

这应该行得通

val data = sc.textFile(datadir + "/dados_frontwave_corte_pedra_ferramenta.csv")
             .map(line => line.split(","))
             .filter(line => line.length>1);

// Building the model
val numIterations = 20;
val model = LinearRegressionWithSGD.train(data, numIterations);

您的意思是:“val parts=line.split(“,”).map(x=>x.replace(“\”,”)”而不是“val parts=line.split(“,”)”“没错。你需要用引号做点什么,我做了。但唯一的变化是以下一行错误:java.lang.NumberFormatException:empty string在添加过滤器后,错误更改为:java.lang.NumberFormatException:对于输入字符串“Ferramenta”,正确,因此LabeledPoint与Double一起工作,“Ferramenta”与此相去甚远。LinearRegressionWithGd.train与LabeledPoint一起使用,但您的CSV也包含数字和字符串。我不知道,你到底想做什么,但就解析CSV而言,我的解决方案很有效。也许你可以看看spark CSV:,它可以为你解析。你的字段中有双引号,你需要对其进行修剪!!如果你给我们一两行输入格式,我会很高兴。输入文件的前三行被添加到问题中。就像我说的,你必须在映射阶段修剪双引号。第一个变量包含一个由双引号包围的双引号,因此它将给出您产生的错误
val parts = line.split(",").map(x => x.replace("\"", "")).filter(x => x.length > 0)
val data = sc.textFile(datadir + "/dados_frontwave_corte_pedra_ferramenta.csv")
             .map(line => line.split(","))
             .filter(line => line.length>1);

// Building the model
val numIterations = 20;
val model = LinearRegressionWithSGD.train(data, numIterations);