Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Dataframe 按反斜杠拆分字符串列_Dataframe_Apache Spark_Pyspark_Split_Apache Spark Sql - Fatal编程技术网

Dataframe 按反斜杠拆分字符串列

Dataframe 按反斜杠拆分字符串列,dataframe,apache-spark,pyspark,split,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Split,Apache Spark Sql,在一个包含路径和文件名(分隔符是反斜杠)的字符串列的数据帧中,我试图将其拆分,并将最后一项作为新列获取filename。我使用的这段代码如下所示: 身份证件 路径 1. C:\Program Files\Notepad++\Notepad.exe 2. C:\ProgramFiles(x86)\Google\Chrome\Application\Chrome.exe 3. C:\Windows\SysWOW64\cmd.exe 4. C:\ProgramFiles\Microsoft Offic

在一个包含路径和文件名(分隔符是反斜杠)的字符串列的数据帧中,我试图将其拆分,并将最后一项作为新列获取filename。我使用的这段代码如下所示:

身份证件 路径 1.
C:\Program Files\Notepad++\Notepad.exe
2.
C:\ProgramFiles(x86)\Google\Chrome\Application\Chrome.exe
3.
C:\Windows\SysWOW64\cmd.exe
4.
C:\ProgramFiles\Microsoft Office\root\Office16\WINWORD.EXE
5.
C:\ProgramFiles\Microsoft Office\root\Office16\EXCEL.EXE

使用正则表达式提取。提取值
alphanumerics.
\
和字符串结尾之间的alphanumerics

  df=df.withColumn('filename',  regexp_extract(col('path'), '((?<=\\\)\w+\.\w+(?=$))', 1))



+---+-----------------------------------------------------------+-----------+
|Id |path                                                       |filename   |
+---+-----------------------------------------------------------+-----------+
|1  |C:\Program Files\Notepad++\notepad.exe                     |notepad.exe|
|2  |C:\Program Files (x86)\Google\Chrome\Application\chrome.exe|chrome.exe |
|3  |C:\Windows\SysWOW64\cmd.exe                                |cmd.exe    |
|4  |C:\Program Files\Microsoft Office\root\Office16\WINWORD.EXE|WINWORD.EXE|
|5  |C:\Program Files\Microsoft Office\root\Office16\EXCEL.EXE  |EXCEL.EXE  |
+---+-----------------------------------------------------------+-----------+

df=df.withColumn('filename',regexp_extract(col('path'),')((?需要四个反斜杠,因为每个解析级别都需要重复转义(Python然后Spark):

df.withColumn("filename", f.element_at(f.split(f.col("path"), "\\\\"), -1)).show(truncate=False)
+---+-----------------------------------------------------------+-----------+
|Id |path                                                       |filename   |
+---+-----------------------------------------------------------+-----------+
|1  |C:\Program Files\Notepad++\notepad.exe                     |notepad.exe|
|2  |C:\Program Files (x86)\Google\Chrome\Application\chrome.exe|chrome.exe |
|3  |C:\Windows\SysWOW64\cmd.exe                                |cmd.exe    |
|4  |C:\Program Files\Microsoft Office\root\Office16\WINWORD.EXE|WINWORD.EXE|
|5  |C:\Program Files\Microsoft Office\root\Office16\EXCEL.EXE  |EXCEL.EXE  |
+---+-----------------------------------------------------------+-----------+