Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/281.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在PySpark ML中创建自定义标记器_Python_Apache Spark_Pyspark_Spark Dataframe_Apache Spark Mllib - Fatal编程技术网

Python 如何在PySpark ML中创建自定义标记器

Python 如何在PySpark ML中创建自定义标记器,python,apache-spark,pyspark,spark-dataframe,apache-spark-mllib,Python,Apache Spark,Pyspark,Spark Dataframe,Apache Spark Mllib,如果我运行命令 sentenceDataFrame = spark.createDataFrame([ (0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logistic,regression,models,are,neat") ], ["id", "sentence"]) tokenizer = Tokenizer(inputCol

如果我运行命令

sentenceDataFrame = spark.createDataFrame([
        (0, "Hi I heard about Spark"),
        (1, "I wish Java could use case classes"),
        (2, "Logistic,regression,models,are,neat")
    ], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words") 
tokenized = tokenizer.transform(sentenceDataFrame)
我希望得到这样的结果

tokenized.head()
然而,现在的结果是

Row(id=0, sentence='Hi I heard about Spark',
    words=['H','i',' ','h','e',‘a’,……])
PySpark中的标记器或RegexTokenizer有什么方法可以实现这一点吗

类似的问题也在这里:

看一下
Tokenizer
仅按空格分割,但
RegexTokenizer
——顾名思义——使用正则表达式查找分割点或要提取的标记(可通过参数
gaps
配置)

如果传递一个空模式并保留
gaps=True
(这是默认值),则应获得所需的结果:

Row(id=0, sentence='Hi I heard about Spark',
    words=['Hi','I','heard','about','spark'])

不客气。如果你的问题解决了,你能接受答案吗?当然。祝你有快乐的一天
from pyspark.ml.feature import RegexTokenizer

tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="")
tokenized = tokenizer.transform(sentenceDataFrame)