如何在Java中对Spark数据帧应用字符串操作
我有一个Spark数据框,看起来像这样:如何在Java中对Spark数据帧应用字符串操作,java,string,apache-spark,spark-dataframe,Java,String,Apache Spark,Spark Dataframe,我有一个Spark数据框,看起来像这样: +--------------------+------+----------------+-----+--------+ | Name | Sex| Ticket |Cabin|Embarked| +--------------------+------+----------------+-----+--------+ |Braund, Mr. Owen ...| male| A/5 211
+--------------------+------+----------------+-----+--------+
| Name | Sex| Ticket |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Braund, Mr. Owen ...| male| A/5 21171| null| S|
|Cumings, Mrs. Joh...|female| PC 17599| C85| C|
|Heikkinen, Miss. ...|female|STON/O2. 3101282| null| S|
|Futrelle, Mrs. Ja...|female| 113803| C123| S|
|Palsson, Master. ...| male| 349909| null| S|
+--------------------+------+----------------+-----+--------+
现在我需要过滤“Name”列,使其只包含标题,即先生、太太、小姐、主人。因此,结果列将是:
+--------------------+------+----------------+-----+--------+
| Name | Sex| Ticket |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Mr. | male| A/5 21171| null| S|
|Mrs. |female| PC 17599| C85| C|
|Miss. |female|STON/O2. 3101282| null| S|
|Mrs. |female| 113803| C123| S|
|Master. | male| 349909| null| S|
+--------------------+------+----------------+-----+--------+
我尝试应用子字符串操作:
List<String> list = Arrays.asList("Mr.","Mrs.", "Mrs.","Master.");
Dataset<Row> categoricalDF2 = categoricalDF.filter(col("Name").isin(list.stream().toArray(String[]::new)));
List List=Arrays.asList(“先生”、“太太”、“太太”、“主人”);
Dataset categoricalDF2=categoricalDF.filter(col(“Name”).isin(list.stream().toArray(String[]::new));
但在Java中似乎不是那么容易。如何在Java中实现它。请注意,我使用的是Spark 2.2.0 我认为下面的代码足以完成这项工作
public class SomeClass {
...
/**
* Return the title of the name.
*/
public String getTitle(String name) {
if (name.contains("Mr.")) { // If it has Mr.
return "Mr.";
} else if (name.contains("Mrs.")) { // Or if has Mrs.
return "Mrs.";
} else if (name.contains("Miss.")) { // Or if has Miss.
return "Miss.";
} else if (name.contains("Master.")) { // Or if has Master.
return "Master.";
} else { // Not any.
return "Untitled";
}
}
}
最后,我设法解决了这个问题,并得到了我自己问题的答案。我用UDF扩展了Mohit的答案:
private static final UDF1<String, Option<String>> getTitle = (String name) -> {
if (name.contains("Mr.")) { // If it has Mr.
return Some.apply("Mr.");
} else if (name.contains("Mrs.")) { // Or if has Mrs.
return Some.apply("Mrs.");
} else if (name.contains("Miss.")) { // Or if has Miss.
return Some.apply("Miss.");
} else if (name.contains("Master.")) { // Or if has Master.
return Some.apply("Master.");
} else { // Not any.
return Some.apply("Untitled");
}
};
嘿,谢谢你的回答。但是,我需要对DataFrame列应用类似的操作,而不是对普通字符串!UDF对此有帮助吗?
SparkSession spark = SparkSession.builder().master("local[*]")
.config("spark.sql.warehouse.dir", "/home/martin/")
.appName("Titanic")
.getOrCreate();
Dataset<Row> df = ....
spark.sqlContext().udf().register("getTitle", getTitle, DataTypes.StringType);
Dataset<Row> categoricalDF = df.select(callUDF("getTitle", col("Name")).alias("Name"), col("Sex"), col("Ticket"), col("Cabin"), col("Embarked"));
categoricalDF.show();
+-----+------+----------------+-----+--------+
| Name| Sex| Ticket|Cabin|Embarked|
+-----+------+----------------+-----+--------+
| Mr.| male| A/5 21171| null| S|
| Mrs.|female| PC 17599| C85| C|
|Miss.|female|STON/O2. 3101282| null| S|
| Mrs.|female| 113803| C123| S|
| Mr.| male| 373450| null| S|
+-----+------+----------------+-----+--------+
only showing top 5 rows