Apache spark 如何在Spark 2.2.0中将Json字符串数组转换为特定列的数据集?

Apache spark 如何在Spark 2.2.0中将Json字符串数组转换为特定列的数据集?,apache-spark,Apache Spark,我有一个由json行组成的数据集ds 示例Json行(这只是数据集中一行的示例) ds.printSchema() 现在我想使用Spark 2.2.0转换为以下数据集 name | address | docs ---------------------------------------------------------------------------------- "foo" | {"state": "CA", "country

我有一个由json行组成的数据集ds

示例Json行(这只是数据集中一行的示例)

ds.printSchema()

现在我想使用Spark 2.2.0转换为以下数据集

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]
最好是Java,但只要Java API中有可用的函数,Scala也可以

这是我到目前为止试过的

val df = Seq("""["{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}" ]""").toDF
df.show(假)


我在Java中找到了一个解决方法。我希望这有帮助

创建一个Bean类(在我的例子中是TempBean)

打印模式:

dff.printSchema();
root
 |-- address: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- docs: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |-- name: string (nullable = true)
val df = Seq("""["{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}" ]""").toDF
|value                                                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}" ]|
import java.util.List;
import java.util.Map;

public class TempBean
    {
        String name;
        Map<String, String> address;
        List<Map<String, String>> docs;
        public String getName()
            {
                return name;
            }
        public void setName(String name)
            {
                this.name = name;
            }
        public Map<String, String> getAddress()
            {
                return address;
            }
        public void setAddress(Map<String, String> address)
            {
                this.address = address;
            }
        public List<Map<String, String>> getDocs()
            {
                return docs;
            }
        public void setDocs(List<Map<String, String>> docs)
            {
                this.docs = docs;
            }

    }
//import com.fasterxml.jackson.core.JsonGenerator;
//import com.fasterxml.jackson.core.JsonParseException;
//import com.fasterxml.jackson.core.JsonProcessingException;
//import com.fasterxml.jackson.core.type.TypeReference;
//import com.fasterxml.jackson.databind.JsonMappingException;
//import com.fasterxml.jackson.databind.ObjectMapper;

ObjectMapper mapper = new ObjectMapper();
List<String> dfList = ds.collectAsList(); //using your Dataset<String>
List<TempBean> tempList = new ArrayList<TempBean>();
try
    {
        for (String json : dfList)
            {
             List<Map<String, Object>> mapList = mapper.readValue(json, new TypeReference<List<Map<String, Object>>>(){});
             for(Map<String,Object> map : mapList)
             {
                TempBean temp = new TempBean();
                temp.setName(map.get("name").toString());
             temp.setAddress((Map<String,String>)map.get("address"));
             temp.setDocs((List<Map<String,String>>)map.get("docs"));
             tempList.add(temp);
             }
            }
    }
catch (JsonParseException e)
    {
        e.printStackTrace();
    }
catch (JsonMappingException e)
    {
        e.printStackTrace();
    }
catch (IOException e)
    {
        e.printStackTrace();
    }
Dataset<Row> dff = spark.createDataFrame(tempList, TempBean.class);
dff.show(false);
+--------------------------------+---------------------------------------+----+
|address                         |docs                                   |name|
+--------------------------------+---------------------------------------+----+
|Map(state -> CA, country -> USA)|[Map(subject -> english, year -> 2016)]|foo |
|Map(state -> OH, country -> USA)|[Map(subject -> math, year -> 2017)]   |bar |
+--------------------------------+---------------------------------------+----+
dff.printSchema();
root
 |-- address: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- docs: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |-- name: string (nullable = true)