Java 评估方法需要很长时间-使用Jpmml的PMML模型_Java_Performance_Apache Spark Mllib_Pmml

Java 评估方法需要很长时间-使用Jpmml的PMML模型

java performance

Java 评估方法需要很长时间-使用Jpmml的PMML模型,java,performance,apache-spark-mllib,pmml,Java,Performance,Apache Spark Mllib,Pmml,今天，我使用Jpmml在代码中加载pmml模型。但是“评估”方法需要很长时间。以下是今天的工作代码： String modelPath = "...."; ModelEvaluatorFactory factory = ModelEvaluatorFactory.newInstance(); InputStream in = new ByteArrayInputStream(modelPath.getBytes("UTF-8")); PMML pmmlMo

今天，我使用Jpmml在代码中加载pmml模型。但是“评估”方法需要很长时间。以下是今天的工作代码：

    String modelPath = "....";
    ModelEvaluatorFactory factory = ModelEvaluatorFactory.newInstance();
    InputStream in = new   ByteArrayInputStream(modelPath.getBytes("UTF-8"));

    PMML pmmlModel = JAXBUtil.unmarshalPMML(new StreamSource(in)); 
    ModelEvaluator<?> evaluator = factory.newModelManager(pmmlModel);
    List<FieldName> activeFields = evaluator.getActiveFields();

    Map<FieldName, FieldValue> defaultFeatures = new HashMap<>();

    //after filling the 'defaultFeatures' the line below takes long time
    Map<FieldName, ?> results = evaluator.evaluate(defaultFeatures);

String modelPath=“…”；
ModelEvaluatorFactory=ModelEvaluatorFactory.newInstance（）；
InputStream in=newbytearrayinputstream（modelPath.getBytes（“UTF-8”）；
pmmlpmmlmodel=JAXBUtil.unmarshalPMML（新的StreamSource（in））；
ModelEvaluator evaluator=工厂.newModelManager（pmmlModel）；
List activeFields=evaluator.getActiveFields（）；
Map defaultFeatures=newhashmap（）；
//填写“defaultFeatures”后，下面的行需要很长时间
映射结果=evaluator.evaluate（默认特征）；

PMML示例：

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
    <Header>
        <Application name="JPMML-SkLearn" version="1.0-SNAPSHOT"/>
        <Timestamp>2017-01-22T14:18:05Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="GENDER" optype="categorical" dataType="string">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
        <DataField name="1GA_" optype="continuous" dataType="double"/>
    //67000 rows of datafield
    </DataDictionary>
    <TransformationDictionary>
        <DefineFunction name="logit" optype="continuous" dataType="double">
            <ParameterField name="value" optype="continuous" dataType="double"/>
            <Apply function="/">
                <Constant dataType="double">1</Constant>
                <Apply function="+">
                    <Constant dataType="double">1</Constant>
                    <Apply function="exp">
                        <Apply function="*">
                            <Constant dataType="double">-1</Constant>
                            <FieldRef field="value"/>
                        </Apply>
                    </Apply>
                </Apply>
            </Apply>
        </DefineFunction>
    </TransformationDictionary>
     <MiningModel functionName="classification">
        <MiningSchema>
            <MiningField name="GENDER" usageType="target"/>
            <MiningField name="1GA_"/>
      //67000 rows of MiningField
       </MiningSchema>
        <Output>
            <OutputField name="probability_0" feature="probability" value="0"/>
            <OutputField name="probability_1" feature="probability" value="1"/>
        </Output>
        <LocalTransformations>
            <DerivedField name="x1" optype="continuous" dataType="double">
                <FieldRef field="1GA_"/>
            </DerivedField>
       //100000 rows
        </LocalTransformations>
         <Segmentation multipleModelMethod="modelChain">
            <Segment id="1">
                <True/>
                <RegressionModel functionName="regression">
                    <MiningSchema>
                        <MiningField name="1GA_"/>
                  </MiningSchema>
                    <Output>
                        <OutputField name="decisionFunction_1"    feature="predictedValue"/>
                        <OutputField name="logitDecisionFunction_1" optype="continuous" dataType="double" feature="transformedValue">
                            <Apply function="logit">
<FieldRef field="decisionFunction_1"/>
                            </Apply>
                        </OutputField>
                    </Output>
                    <RegressionTable intercept="-5.303370169392045">
           <NumericPredictor name="x1" coefficient="0.18476274186559316"/>
          //100000 rows of NumericPredictor

      </RegressionTable>
                 </RegressionModel>
              </Segment>
              <Segment id="2">
                  <True/>
                <RegressionModel functionName="regression">
                    <MiningSchema>
                        <MiningField name="logitDecisionFunction_1"/>
                    </MiningSchema>
                    <Output>
                        <OutputField name="logitDecisionFunction_0"  
     feature="predictedValue"/>
                    </Output>
                    <RegressionTable intercept="1.0">
            <NumericPredictor name="logitDecisionFunction_1" 

           coefficient="-1.0"/>
                        </RegressionTable>
                    </RegressionModel>
                </Segment>
                <Segment id="3">
                    <True/>
                    <RegressionModel functionName="classification">
                        <MiningSchema>
                            <MiningField name="GENDER" usageType="target"/>
                            <MiningField name="logitDecisionFunction_1"/>
                            <MiningField name="logitDecisionFunction_0"/>
                        </MiningSchema>
                        <RegressionTable intercept="0.0" targetCategory="1">
                            <NumericPredictor name="logitDecisionFunction_1" 


     coefficient="1.0"/>
                    </RegressionTable>
                <RegressionTable intercept="0.0" targetCategory="0">
                        <NumericPredictor name="logitDecisionFunction_0"   


       coefficient="1.0"/>
                        </RegressionTable>
                    </RegressionModel>
                </Segment>
         </Segmentation>
        </MiningModel>
        </PMML>


2017-01-22T14:18:05Z
//67000行数据字段
1.
1.
-1
//67000排采矿场
//100000行
//100000行数字预测器

有一种想法是尝试使用MLlib而不是Jpmml。有什么想法吗？谢谢你所说的“加载”是什么意思？是“将PMML文档解析为内存中的数据结构”还是“执行PMML文档”

您的代码似乎是针对后者的。但是它肯定会失败，因为

JAXBUtil#unmarshalpml（Source）

方法是用字节数组调用的，该数组不包含有效的PMML文档（没有XML解析器会接受

“…”.getBytes（“UTF-8”）

）

还有，你说的“需要很长时间”是什么意思？JAXB框架的一次性初始化成本约为1秒。之后，它可以每秒解组约200到500 MB（即兆字节）的PMML内容。您还需要多少？

您好，代码正在运行。需要很长时间的是评估方法。所以我想使用MLlib文件夹。所以，你说的“加载”实际上是指“执行”。JPMML根据PMML文档中存储的执行计划执行模型。执行缓慢，因为PMML文档包含无效的执行计划。您使用什么样的软件生成此PMML文档？Was是apachespark自己的

PMMLExportable

接口，众所周知，它会生成效率极低的执行计划（例如，可以将单个分类数据列扩展到数千个连续数据列）。我添加了pmml模式。感谢您的帮助这是性能问题的根源：

//67000行数据字段

。基本上，JPMML需要执行一个包含67000（“677000”）个参数的函数，您对它的性能不满意吗？您需要重构存储在PMML文档中的执行计划。在这种情况下，您需要弄清楚这67000个数据字段元素真正代表什么。例如，它们可能代表67个分类特征，每个特征都有1000个“深度”类别级别？重构后，这个67参数函数的计算速度将提高1000倍。您的代码示例表明，您在转换和计算方面都使用了过时的JPMML库。您肯定应该升级到JPMML SkLearn 1.2（.6）和JPMML Evaluator 1.3（.4）。性能的提高应该是显而易见的，但很自然，这还不足以让67000个参数的函数运行起来。