如何使用pyspark将html文本转换为纯文本?从字符串替换html标记

如何使用pyspark将html文本转换为纯文本?从字符串替换html标记,pyspark,pyspark-sql,Pyspark,Pyspark Sql,我有一个文本文件,其中有一列“descn”,其中有一些文本,但它们是html格式的。所以我想使用pyspark将html文本转换为纯文本。请帮我做这个 文件名: mdcl_insigt.txt 输入: PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hosp

我有一个文本文件,其中有一列“descn”,其中有一些文本,但它们是html格式的。所以我想使用pyspark将html文本转换为纯文本。请帮我做这个

文件名:

mdcl_insigt.txt
输入:

PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>
您可以尝试执行以下操作:

正则表达式并不完美,可能会失败。请多做些研究使它更好

当我试着用它的时候,它对你的样本字符串起作用了

以下是截图:

Pyspark输出:

df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()

[Row(parsed='PROTEUSÂ We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]

您可以尝试感谢您的回复,但这是用python编写的。我想要pyspark中的解决方案。您可以在pyspark@abhishek.hi@pissall中进行任何类型的python编码,谢谢您的回复。你能详细说明一下吗?我无法理解你的密码。@abhishek你不理解哪部分?我使用正则表达式查找所有html标记,然后用空字符串替换它们。实际上,该列存在于文本文件mdcl_insigt.txt中。因此,您能告诉我如何读取此文本文件并将html文本转换为descn列的纯文本吗?Q&A专门用于从pyspark中的列文本中删除html标记。对于其他问题,你可以问另一个问题。您可以通过选择答案左侧的勾号来结束此问题。我接受了你的回答。你能告诉我如何阅读pyspark中的文本文件吗?
from pyspark.sql.functions import regexp_replace

df = df.withColumn("parsed_descn", regexp_replace("descn", "<[^>]+>", ""))
df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()

[Row(parsed='PROTEUSÂ We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]