逻辑回归:使用PySpark mllib和statsmodel获得相同的系数

逻辑回归:使用PySpark mllib和statsmodel获得相同的系数,pyspark,classification,logistic-regression,statsmodels,Pyspark,Classification,Logistic Regression,Statsmodels,我对使用pyspark时在statsmodel中获得的统计摘要感兴趣。我是pyspark的新手。作为第一步,我尝试使用statstmodel和pyspark运行逻辑回归,以匹配截距和系数。然而,我得到的是不同的 两者的通用部分: from pyspark.sql import SparkSession file_location = "/FileStore/tables/bank.csv" spark = SparkSession.builder.appName('ml-ba

我对使用pyspark时在statsmodel中获得的统计摘要感兴趣。我是pyspark的新手。作为第一步,我尝试使用statstmodel和pyspark运行逻辑回归,以匹配截距和系数。然而,我得到的是不同的

两者的通用部分:

from pyspark.sql import SparkSession
file_location = "/FileStore/tables/bank.csv"
spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.csv(file_location, header = True, inferSchema = True)
temp_table_name = "bank_csv"
df.createOrReplaceTempView(temp_table_name)
numeric_features = [t[0] for t in df.dtypes if t[1] == 'int']

from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ['marital' ]
stages = []
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + 
   "classVec"])
   stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')
stages += [label_stringIdx]
numericCols = ['age', 'balance']
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
#Test train split
train, test = df.randomSplit([0.7, 0.3], seed = 2018)

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', 
maxIter=10,standardization=False,regParam=0.0,elasticNetParam=1)
lrModel = lr.fit(train)
predict_train=lrModel.transform(train)
predict_test=lrModel.transform(test)

print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
import statsmodels.formula.api as smf
df2=df_stats.toPandas()

m1 = smf.logit(

       formula='label ~  C(maritalIndex) + age+ balance',
       data=df2) \
       .fit()
m1.summary()
单独的内容开始:对于Mllib:

from pyspark.sql import SparkSession
file_location = "/FileStore/tables/bank.csv"
spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.csv(file_location, header = True, inferSchema = True)
temp_table_name = "bank_csv"
df.createOrReplaceTempView(temp_table_name)
numeric_features = [t[0] for t in df.dtypes if t[1] == 'int']

from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ['marital' ]
stages = []
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + 
   "classVec"])
   stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')
stages += [label_stringIdx]
numericCols = ['age', 'balance']
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
#Test train split
train, test = df.randomSplit([0.7, 0.3], seed = 2018)

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', 
maxIter=10,standardization=False,regParam=0.0,elasticNetParam=1)
lrModel = lr.fit(train)
predict_train=lrModel.transform(train)
predict_test=lrModel.transform(test)

print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
import statsmodels.formula.api as smf
df2=df_stats.toPandas()

m1 = smf.logit(

       formula='label ~  C(maritalIndex) + age+ balance',
       data=df2) \
       .fit()
m1.summary()
它给出了系数和截距的输出,如下所示:

系数:[-0.1522485297849636,0.435353284847935295,0.01418389527062097,4.6707000626430065e-05]

截距:-0.8298919632117885

现在使用statsmodel查找系数和截距,如下所示:

from pyspark.sql import SparkSession
file_location = "/FileStore/tables/bank.csv"
spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.csv(file_location, header = True, inferSchema = True)
temp_table_name = "bank_csv"
df.createOrReplaceTempView(temp_table_name)
numeric_features = [t[0] for t in df.dtypes if t[1] == 'int']

from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ['marital' ]
stages = []
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + 
   "classVec"])
   stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')
stages += [label_stringIdx]
numericCols = ['age', 'balance']
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
#Test train split
train, test = df.randomSplit([0.7, 0.3], seed = 2018)

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', 
maxIter=10,standardization=False,regParam=0.0,elasticNetParam=1)
lrModel = lr.fit(train)
predict_train=lrModel.transform(train)
predict_test=lrModel.transform(test)

print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
import statsmodels.formula.api as smf
df2=df_stats.toPandas()

m1 = smf.logit(

       formula='label ~  C(maritalIndex) + age+ balance',
       data=df2) \
       .fit()
m1.summary()
截距为-1.0338。类似地,其他系数与mllib(pyspark)的系数不同

我尝试了什么

  • 使用regParam=0将正则化设置为L1(elasticNetParam=1)。这是为了检查它是否有问题 与正规化有关。其目的是如中所讨论的那样转变正规化 scikit学习
有人能帮我找到为什么我用pyspark.ml.classificationstatstmodel得到不同的截距和系数吗


谢谢

您是否检查过创建的设计矩阵是否相同,例如,在这两种情况下,分类变量的编码是否相同,并且没有标准化?正如您所建议的@Josef我用1检查了代码。只有数字列-->两种方法给出相同的结果2.分类列一次只有一个-->当pyspark使用onehotencoder创建矩阵时,设计矩阵似乎不同,而statsmodel则直接执行。我为statsmodel创建了一个虚拟变量,这样设计矩阵看起来与pyspark中的类似。截距和系数的输出非常接近(在小数点后第二点处不同)。再也不能接近了。感谢添加更多类别导致设计矩阵发生变化。有人能建议如何使onehotencoder(pyspark)的输出与StatsModels的输出具有相似的设计矩阵吗?请检查所创建的设计矩阵是否相同,例如,在这两种情况下,分类变量的编码是否相同,并且没有标准化?正如您所建议的@Josef我用1检查了代码。只有数字列-->两种方法给出相同的结果2.分类列一次只有一个-->当pyspark使用onehotencoder创建矩阵时,设计矩阵似乎不同,而statsmodel则直接执行。我为statsmodel创建了一个虚拟变量,这样设计矩阵看起来与pyspark中的类似。截距和系数的输出非常接近(在小数点后第二点处不同)。再也不能接近了。感谢添加更多类别导致设计矩阵发生变化。有人能建议如何使onehotencoder(pyspark)的输出与statsmodels的输出具有相似的设计矩阵吗