Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在Apache spark中读取包含分隔符的值的CSV文件_Scala_Apache Spark_Hadoop_Apache Spark Sql - Fatal编程技术网

Scala 在Apache spark中读取包含分隔符的值的CSV文件

Scala 在Apache spark中读取包含分隔符的值的CSV文件,scala,apache-spark,hadoop,apache-spark-sql,Scala,Apache Spark,Hadoop,Apache Spark Sql,读取csv文件的有效方法是什么?在该文件中,值包含分隔符本身在apache spark中 下面是我的数据集: ID,Name,Age,Add,ress,Salary 1,Ross,32,Ah,med,abad,2000 2,Rachel,25,Delhi,1500 3,Chandler,23,Kota,2000 4,Monika,25,Mumbai,6500 5,Mike,27,Bhopal,8500 6,Phoebe,22,MP,4500 7,Joey,24,Indore,10000 需要清

读取csv文件的有效方法是什么?在该文件中,值包含分隔符本身在apache spark

下面是我的数据集

ID,Name,Age,Add,ress,Salary
1,Ross,32,Ah,med,abad,2000
2,Rachel,25,Delhi,1500
3,Chandler,23,Kota,2000
4,Monika,25,Mumbai,6500
5,Mike,27,Bhopal,8500
6,Phoebe,22,MP,4500
7,Joey,24,Indore,10000

需要清理数据,因为当文本分隔符不可预测时,无法系统地生成数据帧

一种方法是移动最后一列,并将原始地址数据括在引号中:

val rdd = sc.textFile("file.csv")

//move last column
val rdd2 = rdd.map(s => s.substring(s.lastIndexOf(",")+1) 
               + "," + s.substring(0, s.lastIndexOf(",")))

//enclose last column in " and make a DS
val stringDataset = rdd2.map(s => s.replaceAll("(.*?,.*?,.*?,.*?,|.$)", "$1\"")).toDS()

//create data frame:
val df = spark.read.option("header","true").csv(stringDataset)
df.show()
输出:

+------+---+--------+---+-----------+
|Salary| ID|    Name|Age|   Add,ress|
+------+---+--------+---+-----------+
|  2000|  1|    Ross| 32|Ah,med,abad|
|  1500|  2|  Rachel| 25|      Delhi|
|  2000|  3|Chandler| 23|       Kota|
|  6500|  4|  Monika| 25|     Mumbai|
|  8500|  5|    Mike| 27|     Bhopal|
|  4500|  6|  Phoebe| 22|         MP|
| 10000|  7|    Joey| 24|     Indore|
+------+---+--------+---+-----------+
{


}

添加,ress
地址的输入错误吗?不是。不是输入错误。就这样来了。即使我们单独处理标题。我们如何处理这些数据?考虑到这一点,我将只拥有那些数量的列。
//  1. read csv:
  val df1 = spark.read.option("header", "true").csv(fileFullName)
  df1.show(false)
// when you have format: 
//  ID,Name,Age,Add,ress,Salary
//  1,Ross,32,Ah,"med,abad",2000
//  2,Rachel,25,Delhi,,1500
//  3,Chandler,23,Kota,,2000
//  4,Monika,25,Mumbai,,6500
//  5,Mike,27,Bhopal,,8500
//  6,Phoebe,22,MP,,4500
//  7,Joey,24,Indore,,10000

//  3. result 


//    +---+--------+---+------+--------+------+
//    |ID |Name    |Age|Add   |ress    |Salary|
//    +---+--------+---+------+--------+------+
//    |1  |Ross    |32 |Ah    |med,abad|2000  |
//    |2  |Rachel  |25 |Delhi |null    |1500  |
//    |3  |Chandler|23 |Kota  |null    |2000  |
//    |4  |Monika  |25 |Mumbai|null    |6500  |
//    |5  |Mike    |27 |Bhopal|null    |8500  |
//    |6  |Phoebe  |22 |MP    |null    |4500  |
//    |7  |Joey    |24 |Indore|null    |10000 |
//    +---+--------+---+------+--------+------+