Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何使用Spark从文本文件中提取多行记录_Scala_Apache Spark - Fatal编程技术网

Scala 如何使用Spark从文本文件中提取多行记录

Scala 如何使用Spark从文本文件中提取多行记录,scala,apache-spark,Scala,Apache Spark,我有一个文本文件。这些记录都是由“\n\n\n\n\n”分隔的多行记录。文本文件如下所示: name: Steven gender: male title: mr. company: ABC cell 647-777-**** home 905-000-**** work 289-***-1111 name: Al gender: male title: mr. company: DEF home 905-111-**** cell 289-991-**** 我所做的是以下代码:

我有一个文本文件。这些记录都是由“\n\n\n\n\n”分隔的多行记录。文本文件如下所示:

name: Steven
gender: male
title: mr.
company: ABC

cell 647-777-****
home 905-000-****
work 289-***-1111





name: Al
gender: male
title: mr.
company: DEF

home 905-111-****
cell 289-991-****
我所做的是以下代码:

val contact_raw = sc.wholeTextFiles("/user/data/contact.txt").flatMap(x => x._2.split("\n\n\n\n\n"))
val contact = contact_raw.map(contacts => {
    val per_person = contacts.split("\n\n")
    (per_person(0), per_person(1))
}).map(
    contact_info => {
        val personal_info = contact_info._1.split("\n")
        var name = ""
        var company = ""
        var gender = ""
        var title = ""
        for (x <- personal_info) {
            if(x.startsWith("name:")){
                name = x.split("name:")(1).trim
            } else if(x.startsWith("gender:")){
                gender = x.split("gender:")(1).trim
            } else if(x.startsWith("title:")){
                title = x.split("title:")(1)
            } else if(x.startsWith("company:")){
                company = x.split("company:")(1)
            } 
        }
        val phone_info = contact_info._2.split("\n").map(
                pair => {
                    val phone_pair = pair.split("\\s")
                    (phone_pair(0), phone_pair(1))
                }
            )
        (name, gender, title, company, phone_info)
    }
).toDF("name", "gender", "title", "company", "phone_info")
模式是:

scala> contact.printSchema
root
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- title: string (nullable = true)
 |-- company: string (nullable = true)
 |-- phone_info: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: string (nullable = true)
预期产出为:

scala> contact.show
+------+------+-----+-------+--------------------+
|  name|gender|title|company|          phone_info|
+------+------+-----+-------+--------------------+
|Steven|  male|  mr.|    ABC|[[cell,647-777-**...|
|    Al|  male|  mr.|    DEF|[[home,905-111-**...|
+------+------+-----+-------+--------------------+
+------+------+-----+-------+-------------+------------+
|  name|gender|title|company|   phone_type|number      |
+------+------+-----+-------+-------------+------------+
|Steven|  male|  mr.|    ABC|         cell|647-777-****|
|Steven|  male|  mr.|    ABC|         home|905-000-****|
|Steven|  male|  mr.|    ABC|         work|289-***-1111|
|    Al|  male|  mr.|    DEF|         home|905-111-****|
|    Al|  male|  mr.|    DEF|         cell|289-991-****|
+------+------+-----+-------+-------------+------------+

谁能告诉我如何修改代码以获得所需的输出?

以下方法可行:

val contact = contact_raw.map(contacts => {
      val per_person = contacts.split("\n\n")
      (per_person(0), per_person(1))
    }).flatMap(
      contact_info => {
        val personal_info = contact_info._1.split("\n")
        var name = ""
        var company = ""
        var gender = ""
        var title = ""
        for (x <- personal_info) {
          if (x.startsWith("name:")) {
            name = x.split("name:")(1).trim
          } else if (x.startsWith("gender:")) {
            gender = x.split("gender:")(1).trim
          } else if (x.startsWith("title:")) {
            title = x.split("title:")(1)
          } else if (x.startsWith("company:")) {
            company = x.split("company:")(1)
          }
        }
        contact_info._2.split("\n").map(
          pair => {
            val phone_pair = pair.split("\\s")
            (name, gender, title, company, phone_pair(0), phone_pair(1))
          }
        )
      }
    ).toDF("name", "gender", "title", "company", "phone_info")
val contact=contact\u raw.map(contacts=>{
val per_person=联系人。拆分(“\n\n”)
(人均(0)、人均(1))
}).平面图(
联系信息=>{
val personal\u info=联系人信息。\u 1.拆分(“\n”)
var name=“”
var company=“”
var gender=“”
var title=“”
对于(x{
val phone\u pair=pair.split(\\s)
(姓名、性别、职务、公司、电话线对(0)、电话线对(1))
}
)
}
).toDF(“姓名”、“性别”、“职务”、“公司”、“电话信息”)