MongoDB映射功能中的多个文件输入

MongoDB映射功能中的多个文件输入,mongodb,Mongodb,实际上,我想在mongodb中实现一个数据聚类算法。 我有两个文件 数据文件:它有带时间戳的数据点 例: 配置文件(元数据) 维度类别粒度 0, N, 4,0,100 [ 0 th dimesion is numeric has granularity 4 and starts from 0 & goes till 100 i.e. 0-25, 26-50, 51-75,76-100] 1,N,2,0,50 [Ist dimension has gran = 2 thus 0

实际上,我想在mongodb中实现一个数据聚类算法。 我有两个文件

  • 数据文件:它有带时间戳的数据点
  • 例:

  • 配置文件(元数据)
  • 维度类别粒度

    0, N, 4,0,100  [ 0 th dimesion is numeric has granularity 4 and starts from 0 & goes till 100 i.e. 0-25, 26-50, 51-75,76-100]
    1,N,2,0,50      [Ist dimension has gran = 2 thus 0-25, 26-50]
    2,C,A,B,C,D [2nd dimension is categorical and as values a,b,c,d therefore granularity 4]
    
    现在,我必须在mongodb中构建一个MAP-REDUCE函数,通过输入上述文件,为时间戳处的数据提供d签名:

    6- 1,1,0 
    7- 3,1,1  
    
    等等

    我必须运行map reduce,将两个文件都作为输入。。但是我在mongodb map reduce中找不到任何方法来获取输入的多个文件。 如果有人有什么想法,请告诉我怎么做


    谢谢

    mapReduce作业的输入是MongoDB集合,不能是数据文件

    但是,在您的情况下,您不需要mapReduce(),因为没有“reduce”部分—您所做的只是将1:1的输入记录转换为输出记录

    因此,第一步是将数据文件存储到集合“inp”中——将时间序列作为数组存储在文档中。如果您的数据文件会导致一个大于16MB的文档,那么您必须将其拆分为多个文档-为了示例起见,我将每个文档仅存储2个时间戳元素。我用JavaScript为mongo shell编写了以下示例:

    PATH = "/home/ronald/mongotest/";
    DATA = "data.file";
    ELEMS_PER_DOC = 2; // number of emelements in "series" per document
    
    db.data.drop();
    data = cat( PATH + DATA );
    lines = data.split("\n")
    lines = lines.splice(0,lines.length-1);
    series = [];
    lines.forEach(function( line ) {
        if ( series.length >= ELEMS_PER_DOC ) {
            db.data.insert({ "series": series });
            series = [];
        }
        l = line.split("|");
        timestamp = l[0];
        d = l[1].split(",");
        series.push( { "ts": timestamp, "data": d } );
    });
    db.data.insert({ "series": series });
    
    对于给定的数据文件:

    6|46,36,A
    7|90,45,B
    8|45,12,C
    9|34,67,D
    
    这将产生以下集合:

    > db.data.find().pretty()
    {
        "_id" : ObjectId("515e735657a0887a97cc8d23"),
        "series" : [
            {
                "ts" : "6",
                "data" : [
                    "46",
                    "36",
                    "A"
                ]
            },
            {
                "ts" : "7",
                "data" : [
                    "90",
                    "45",
                    "B"
                ]
            }
        ]
    }
    {
        "_id" : ObjectId("515e735657a0887a97cc8d24"),
        "series" : [
            {
                "ts" : "8",
                "data" : [
                    "45",
                    "12",
                    "C"
                ]
            },
            {
                "ts" : "9",
                "data" : [
                    "34",
                    "67",
                    "D"
                ]
            }
        ]
    }
    
    db.out.drop();
    cursor = db.data.find();
    cursor.forEach( function (doc) {
        doc.series.forEach( function (serie) {
            for ( i=0; i<serie.data.length; i++ ) {
                // apply transformation function for each dimension
                serie.data[i] = funcs[i]( serie.data[i] );            
            }
        });
        db.out.insert( doc );
    })
    
    [注意:如果您不想在MongoDB中存储输入数据,那么只需构建“series”数组,并在第三步中将其用作输入。观察客户端计算机上的内存使用情况!]

    下一步是从配置文件生成JavaScript函数,这些函数将用于根据规则集转换数据。实际上,这将是一个函数数组,以避免硬编码的三维限制

    PATH = "/home/ronald/mongotest/";
    CONFIG = "config.file";
    
    config = cat( PATH + CONFIG );
    lines = config.split("\n")
    lines = lines.splice(0,lines.length-1);
    // array of functions - index = dimension
    funcs = [];
    lines.forEach(function( line ) {
        x = line.split(",");
        f = "";
        if ( x[1] == "N" ) {
            // Numeric rule: 
            // x[2] = granularity
            // x[3],x[4] lower,upper range
            // the function to be called for the given value looks like:
            //    function( val ) returns: the interval or "n/a" if outside range
            // the interval is given by (val - (val modulo intervalSize)) / intervalSize
            // the intervalSize is (max - min) / granularity
            intervalSize = (x[4] - x[3]) / x[2];
            f = "function (val) {";
            f += "  if ( val < "+x[3]+" || val > "+x[4]+" ) return 'n/a';";
            f += "  return (val - val % "+intervalSize+") / "+intervalSize+";";
            f += "}";
        } else if ( x[1] == "C" ) {
            // Categoric rule: 
            // return the position of value in the array of params
            // skip dimension and rule type
            x = x.splice(2, x.length-1);
            // build parameter array
            pa = '[';
            x.forEach( function(p) { pa += '"' + p + '",' } );
            pa += ']';
            // the function will return -1 if value not found in array
            f = "function (val) { return "+pa+".indexOf(val) }";
        }
        else {
            // unknown rule type
            f = "function (val) { return 'rule err' }";
        }
        eval( "fx = "+f );
        funcs.push( fx );
    });
    
    这将生成以下函数数组:

    > funcs
    [
        function (val) {  if ( val < 0 || val > 100 ) return 'n/a';  return (val - val % 25) / 25;},
        function (val) {  if ( val < 0 || val > 50 ) return 'n/a';  return (val - val % 25) / 25;},
        function (val) { return ["A","B","C","D",].indexOf(val) }
    ]
    
    > funcs
    [
        function (val) {  if ( val < 0 || val > 100 ) return 'n/a';  return (val - val % 25) / 25;},
        function (val) {  if ( val < 0 || val > 50 ) return 'n/a';  return (val - val % 25) / 25;},
        function (val) { return ["A","B","C","D",].indexOf(val) }
    ]
    
    db.out.drop();
    cursor = db.data.find();
    cursor.forEach( function (doc) {
        doc.series.forEach( function (serie) {
            for ( i=0; i<serie.data.length; i++ ) {
                // apply transformation function for each dimension
                serie.data[i] = funcs[i]( serie.data[i] );            
            }
        });
        db.out.insert( doc );
    })
    
    > db.out.find().pretty()
    {
        "_id" : ObjectId("515e974c57a0887a97cc8d2f"),
        "series" : [
            {
                "ts" : "6",
                "data" : [
                    1,
                    1,
                    0
                ]
            },
            {
                "ts" : "7",
                "data" : [
                    3,
                    1,
                    1
                ]
            }
        ]
    }
    {
        "_id" : ObjectId("515e974c57a0887a97cc8d30"),
        "series" : [
            {
                "ts" : "8",
                "data" : [
                    1,
                    0,
                    2
                ]
            },
            {
                "ts" : "9",
                "data" : [
                    1,
                    "n/a",
                    3
                ]
            }
        ]
    }