Scala 如何从逗号分隔的字符串中提取最后一个元素?
使用此查询:Scala 如何从逗号分隔的字符串中提取最后一个元素?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,使用此查询: sql("SELECT _location, count(1) FROM tablaTemporal group by _location order by 2 desc" ) 我收到以下输出: +--------------------------------+--------+ |_location |count(1)| +--------------------------------+--------+ |London, Uni
sql("SELECT _location, count(1) FROM tablaTemporal group by _location order by 2 desc" )
我收到以下输出:
+--------------------------------+--------+
|_location |count(1)|
+--------------------------------+--------+
|London, United Kingdom |15 |
|United States |12 |
|Bangalore, India |8 |
|Hyderabad, India |7 |
|Paris, France |6 |
|San Francisco, CA, United States|6 |
|Mountain View, CA, United States|4 |
|Pune, India |4 |
|Bengaluru, Karnataka, India |3 |
+--------------------------------+--------+
但我需要的结果是:
+--------------------------------+--------+
|_location |count(1)|
+--------------------------------+--------+
|United States |22 |
|India |22 |
|United Kingdom |15 |
|France |6 |
+--------------------------------+--------+
因此,我需要使用以下句子:
sql("SELECT SubstringOfLocationFromCharComma(_location), count(1) FROM tablaTemporal group by _location order by 2 desc" )
如何从逗号分隔的字符串中提取最后一个元素?您可以使用
regexp\u extract
import org.apache.spark.sql.functions._
val df = Seq(
"London, United Kingdom", "Bengaluru, Karnataka, India"
).toDF("_location")
df.select(regexp_extract($"_location", ".*,([^,]*)$", 1).alias("country")).show
// +---------------+
// | country|
// +---------------+
// | United Kingdom|
// | India|
// +---------------+
您可以使用
regexp\u extract
import org.apache.spark.sql.functions._
val df = Seq(
"London, United Kingdom", "Bengaluru, Karnataka, India"
).toDF("_location")
df.select(regexp_extract($"_location", ".*,([^,]*)$", 1).alias("country")).show
// +---------------+
// | country|
// +---------------+
// | United Kingdom|
// | India|
// +---------------+
由于国家名称是逗号后的最后一个元素,因此也可以执行以下操作:
df.show(false)
+--------------------------------+
|a |
+--------------------------------+
|Mountain View, CA, United States|
|Pune, India |
|Bengaluru, Karnataka, India |
+--------------------------------+
df.withColumn("a" , split($"a", ",") ).withColumn("a" , expr("a[ size(a) -1 ] ") ).show
+--------------+
|a |
+--------------+
| United States|
| India |
| India |
+--------------+
然后是一个
groupBy($“a”).agg(sum($“count(1)”).as(“count”)
,以获得所需的结果。由于国家名称是逗号后的最后一个元素,您还可以执行以下操作:
df.show(false)
+--------------------------------+
|a |
+--------------------------------+
|Mountain View, CA, United States|
|Pune, India |
|Bengaluru, Karnataka, India |
+--------------------------------+
df.withColumn("a" , split($"a", ",") ).withColumn("a" , expr("a[ size(a) -1 ] ") ).show
+--------------+
|a |
+--------------+
| United States|
| India |
| India |
+--------------+
接下来是一个
groupBy($“a”).agg(sum($“count(1)”).as(“count”)
,以实现所需的结果。有一种说法,当您遇到问题并使用正则表达式来解决时,您会立即遇到两个问题:)好的ol'拆分
函数有什么问题吗?@JacekLaskowski这不是可怜的开发人员所说的吗?StackOverflow工作不正常。scala:106:value$不是StringContext的成员[error]val df2=df。选择(regexp_extract($“\u location”,“*,([^,]*)$”,1)。别名(“locationSplitted”))可能是因为我在使用SBT,因为在作用域中没有隐式转换。有一句话说,当你遇到问题并使用正则表达式来解决它时,你会立即遇到两个问题:)好的ol'split
函数有什么问题吗?@JacekLaskowski不是可怜的开发人员说的吗?StackOverflow工作不正常。scala:106:value$不是StringContext[error]val df2=df的成员。select(regexp_extract($“\u location”,“*,([^,]*)$”,1)。别名(“locationSplitted”))可能是因为我正在使用SBTBE,因为范围中没有隐式转换