Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用于在hadoop中读取json的自定义inputformat_Json_Hadoop_Mapreduce_Bigdata - Fatal编程技术网

用于在hadoop中读取json的自定义inputformat

用于在hadoop中读取json的自定义inputformat,json,hadoop,mapreduce,bigdata,Json,Hadoop,Mapreduce,Bigdata,我是hadoop的初学者,有人告诉我创建一个自定义inputformat类来读取json数据,我在谷歌上搜索并学习了如何创建一个自定义inputformat类来读取文件中的数据。但我一直在解析json数据。 我的json数据如下所示 [ { "_count": 30, "_start": 0, "_total": 180, "values": [ { "attachme

我是hadoop的初学者,有人告诉我创建一个自定义inputformat类来读取json数据,我在谷歌上搜索并学习了如何创建一个自定义inputformat类来读取文件中的数据。但我一直在解析json数据。 我的json数据如下所示

[
    {
        "_count": 30,
        "_start": 0,
        "_total": 180,
        "values": [
            {
                "attachment": {
                    "contentDomain": "techcarnival2013.eventbrite.com",
                    "contentUrl": "http://techcarnival2013.eventbrite.com/",
                    "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
                    "summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
                    "title": "Tech Carnival @ Candlestick Park"
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373908436000,
                "creator": {
                    "firstName": "Clayton",
                    "headline": "Director of Operations",
             "secondname":{
                "name":"myname"
                },
                    "lastName": "K.",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
                "title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
            },
            {
                "attachment": {
                    "contentDomain": "lifebeyondnumbers.com",
                    "contentUrl": "http://bit.ly/10VTqMu",
                    "imageUrl": "http://lifebeyondnumbers.com/wp-content/uploads/2013/07/lurnq_Online_Courses.jpg",
                    "summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
                    "title": "LurnQ - making lifelong learning clutter free, fun and a social..."
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373883177000,
                "creator": {
                    "firstName": "Syed",
                    "headline": "Founder and CEO at QubiqSquare",
                    "lastName": "Muksit",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_Y5gdzlRCbQBTqIa-pXYnz-01b6KinDO-pFWnz-ZCZLk1WWdt-_SLUt2uWmrpzo0OxQxcVv6pRjbE"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
                "title": "There is so much to learn and most of the times, we don\u2019t even know that this-and-that good stuff exists.  http://bit.ly/10VTqMu"
            },
            {
                "attachment": {
                    "contentDomain": "techcarnival2013.eventbrite.com",
                    "contentUrl": "http://techcarnival2013.eventbrite.com/",
                    "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
                    "summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
                    "title": "Tech Carnival @ Candlestick Park"
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373654758000,
                "creator": {
                    "firstName": "Clayton",
                    "headline": "Director of Operations",
                    "lastName": "K.",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
                "title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
            }
..........
........ so on

]

因此,我对如何读取自定义inputformat类中的json对象感到困惑。关于如何解析它,有什么想法吗?我想读取json数组中的单个json对象,我的意思是阅读正确的json字符串,然后将该字符串提供给map,我将在map中使用json解析器来构建我自己的键值对。对此有什么帮助吗?提前感谢

如果您的问题与Magham Ravi的评论一致,答案很好

但是,如果您有一个包含上面提到的所有JSON数据的文件,那么您可能希望读取整个文件,并将其作为字符串从map函数中的值部分(BytesWritable value)中检索,然后将其提供给同一map()函数中可用的JSON解析器

请看一看

此外,如果您在一个文件中有多个JSON对象数据,以及如何在映射器中获取每个JSON对象数据作为值,那么您可以使用类似于定义了开始和结束标记的内容。对于JSON,您必须有一个唯一的开始和结束标记,精确地标记您想要的单个JSON数据对象的开始和结束。仅仅是,如果您希望将上面的整个JSON对象作为值返回,那么使用start tag=“[{”和end tag=“}]”可能没有帮助,因为您已经有许多嵌套对象会混淆InputFormat

如果在任何情况下都无法实现上述目标,请尝试构建在TextInputFormat中定义的customTextInputFormat覆盖

在LineReader类中,您将对这两个集合进行优化(我可能有点过时,请使用配置属性检查这是否可配置,我知道CDH已使其可配置,如果不需要重写的话)

您可以放开CR并将LF更改为指向“]\n[”,因为您的每个独立JSON数据都将采用如图所示的格式,或者您将更清楚地知道如何使用它

[

…JSON 1

]

[

…JSON 2

]

[

…JSON N

]

(注意:在]和[之间有一个\n标记为不同JSON对象数据之间的边界)


希望这是有意义的。

是一个json对象在一行中还是占用多行?如果是一行,请看一看,我理解的是,您能够在映射器中获取完整的json数据,但无法解析它?@Magham Ravi它占用多行我正试图让我的映射器每次读取一个json对象非常感谢你提出的宝贵建议,我会尝试一下你的逻辑并告诉你。
private static final byte CR = '\r';
private static final byte LF = '\n';