Apache spark 基于长字符串中的两个字符串标识符的子字符串_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql

Apache spark 基于长字符串中的两个字符串标识符的子字符串

apache-spark pyspark

Apache spark 基于长字符串中的两个字符串标识符的子字符串,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有一个简单的要求，我有一个数据框，其中只有一个字符串字段和一个非常大的字符串值。我只想把它切碎以选择所需的信息我的数据框中的字符串字段包含以下值- Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, co

我有一个简单的要求，我有一个数据框，其中只有一个字符串字段和一个非常大的字符串值。我只想把它切碎以选择所需的信息

我的数据框中的字符串字段包含以下值-

Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dw_date, type:string, comment:null)], location:hdfs://share/dev/stage/partition_chk, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=, field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dw_date, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1559243466}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)

我只想从这个值中得到分区的基本位置，它是——hdfs://share/dev/stage/partition_chk

请注意，我只希望上面引用的字符串没有位置：前缀。你知道pyspark中的什么子串操作会起到什么作用吗

谢谢

有几种方法可以做到这一点，但在我看来，正则表达式是最直接的方法。在pyspark中，您需要函数来应用正则表达式并提取匹配组。正则表达式对你来说是下一件重要的事情。以下正则表达式：

位置：[a-zA-Z:\/\/\]*

匹配以下所有字符：

小写字符大写字符 : / _ 在遇到位置：。当然，您也可以使用类似location:[^，]*，它匹配location:之后的所有内容，直到第一个逗号，但这实际上取决于可能的匹配。以下是一个例子：

从pyspark.sql导入函数为F l=[ TabletableName:partition_chk，dbName:stage，owner:hive，createTime:1559243466，lastAccessTime:0，retention:0，sd:StorageDescriptorcols:[FieldSchemaname:id，type:string，comment:null，FieldSchemaname:name，type:string，comment:string，comment:null]，地点：hdfs://share/dev/stage/partition_chk，inputFormat:org.apache.hadoop.mapred.TextInputFormat，outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat，压缩：false，numBuckets:-1，serdeInfo:serdeInfo名称：null，serializationLib:org.apache.hadoop.hive.serde2.LazySimpleSerDe，参数：{line.delim=，field.delim=，serialization.format=，}，bucketCols:[]，sortCols:[]，参数：{}，skewedInfo:SkewedFoskewedColNames:[]，skewedColValues:[]，skewedColValueLocationMaps:{}，storedAsSubDirectories:false，partitionKeys:[FieldSchemaname:dw_date，type:string，comment:null]，参数：{EXTERNAL=TRUE，transient\U lastDdlTime=1559243466}，viewOriginalText:null，viewExpandedText:null，表格类型：外部表格， ] 列=['hugeString'] df=spark.createDataFramel，列 collect将数据帧转换为python行列表我不知道你是否需要这个如果要将其提取到新列中，请使用withColumn而不是select df.selectF.regexp\u提取'hugeString'，位置：[a-zA-Z:\/\uz]*，1.别名'match'。收集[0]['match'] 输出：

hdfs://share/dev/stage/partition_chk

非常感谢你…正是我想要的！