Python 如何在PySpark ML中创建自定义标记器
如果我运行命令Python 如何在PySpark ML中创建自定义标记器,python,apache-spark,pyspark,spark-dataframe,apache-spark-mllib,Python,Apache Spark,Pyspark,Spark Dataframe,Apache Spark Mllib,如果我运行命令 sentenceDataFrame = spark.createDataFrame([ (0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logistic,regression,models,are,neat") ], ["id", "sentence"]) tokenizer = Tokenizer(inputCol
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(sentenceDataFrame)
我希望得到这样的结果
tokenized.head()
然而,现在的结果是
Row(id=0, sentence='Hi I heard about Spark',
words=['H','i',' ','h','e',‘a’,……])
PySpark中的标记器或RegexTokenizer有什么方法可以实现这一点吗
类似的问题也在这里:看一下Tokenizer
仅按空格分割,但RegexTokenizer
——顾名思义——使用正则表达式查找分割点或要提取的标记(可通过参数gaps
配置)
如果传递一个空模式并保留gaps=True
(这是默认值),则应获得所需的结果:
Row(id=0, sentence='Hi I heard about Spark',
words=['Hi','I','heard','about','spark'])
不客气。如果你的问题解决了,你能接受答案吗?当然。祝你有快乐的一天
from pyspark.ml.feature import RegexTokenizer
tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="")
tokenized = tokenizer.transform(sentenceDataFrame)