Python-如何对要分解的大型(11GB)JSON文件进行流式处理

Python-如何对要分解的大型(11GB)JSON文件进行流式处理,python,json,pandas,out-of-memory,analysis,Python,Json,Pandas,Out Of Memory,Analysis,我有一个非常大的JSON(11GB)文件,它太大了,无法读入我的内存。 我想把它分解成更小的文件来分析数据。我目前正在使用Python和Pandas进行分析,我想知道是否有某种方法可以访问文件的块,以便在不破坏程序的情况下将其读入内存。理想情况下,我希望将这些年的数据分解成更小的可管理文件,这些文件跨度约为一周,但是数据大小并不是固定不变的,尽管它们是否是固定的间隔并不重要 这是数据格式 { "actor" : { "classification" : [ "suggested" ],

我有一个非常大的JSON(11GB)文件,它太大了,无法读入我的内存。 我想把它分解成更小的文件来分析数据。我目前正在使用Python和Pandas进行分析,我想知道是否有某种方法可以访问文件的块,以便在不破坏程序的情况下将其读入内存。理想情况下,我希望将这些年的数据分解成更小的可管理文件,这些文件跨度约为一周,但是数据大小并不是固定不变的,尽管它们是否是固定的间隔并不重要

这是数据格式

{
"actor" : 
{
    "classification" : [ "suggested" ],
    "displayName" : "myself",
    "followersCount" : 0,
    "followingCount" : 0,
    "followingStocksCount" : 0,
    "id" : "person:stocktwits:183087",
    "image" : "http://avatars.stocktwits.com/production/183087/thumb-1350332393.png",
    "link" : "http://stocktwits.com/myselfbtc",
    "links" : 
    [

        {
            "href" : null,
            "rel" : "me"
        }
    ],
    "objectType" : "person",
    "preferredUsername" : "myselfbtc",
    "statusesCount" : 2,
    "summary" : null,
    "tradingStrategy" : 
    {
        "approach" : "Technical",
        "assetsFrequentlyTraded" : [ "Forex" ],
        "experience" : "Novice",
        "holdingPeriod" : "Day Trader"
    }
},
"body" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
"entities" : 
{
    "chart" : 
    {
        "fullImage" : 
        {
            "link" : "http://charts.stocktwits.com/production/original_10047145.png"
        },
        "image" : 
        {
            "link" : "http://charts.stocktwits.com/production/small_10047145.png"
        },
        "link" : "http://stks.co/iDEB",
        "objectType" : "image"
    },
    "sentiment" : 
    {
        "basic" : "Bearish"
    },
    "stocks" : 
    [

        {
            "displayName" : "Bitcoin",
            "exchange" : "PRIVATE",
            "industry" : null,
            "sector" : null,
            "stocktwits_id" : 9659,
            "symbol" : "BCOIN"
        }
    ],
    "video" : null
},
"gnip" : 
{
    "language" : 
    {
        "value" : "en"
    }
},
"id" : "tag:gnip.stocktwits.com:2012:note/10047145",
"inReplyTo" : 
{
    "id" : "tag:gnip.stocktwits.com:2012:note/10046953",
    "objectType" : "comment"
},
"link" : "http://stocktwits.com/myselfbtc/message/10047145",
"object" : 
{
    "id" : "note:stocktwits:10047145",
    "link" : "http://stocktwits.com/myselfbtc/message/10047145",
    "objectType" : "note",
    "postedTime" : "2012-10-17T19:13:50Z",
    "summary" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
    "updatedTime" : "2012-10-17T19:13:50Z"
},
"provider" : 
{
    "displayName" : "StockTwits",
    "link" : "http://stocktwits.com"
},
"verb" : "post"
}

我认为您需要流解析器之类的东西。艾森可能工作:


JQ1.5有一个流式解析器(记录在)。从某种意义上讲,它很容易使用,例如,如果您的1G文件名为1G.json,那么以下命令将生成一个行流,包括每个“叶”值一行:

jq-c——流。1G.json

(输出如下所示。请注意,每一行本身都是有效的JSON。)

但是,使用流式输出可能不是那么容易,但这取决于您想做什么:-)

理解流式输出的关键在于大多数线路的形式:

[路径,值]

其中“PATH”是路径的数组表示形式。(使用jq时,此数组实际上可以用作路径。)


这听起来是明智之举,而金钱则是愚蠢之举。编程时间通常很昂贵。所以你有一个11GB的文件。要么购买16/32GB内存并升级您的计算机,要么租一台真正大型的虚拟机。亚马逊的EC2机型
c4.4XL
,32GB内存低于每小时1美元。这也不是他们最大的一个。谷歌也有一些大内存虚拟机在他们的云中出租,我记得这些虚拟机更贵。了解如何在不使用时完全删除VM,这样您就不会因为空闲时间而收到账单,并且在出现这些问题时,您将始终能够访问大型计算机。
[["actor","classification",0],"suggested"]
[["actor","classification",0]]
[["actor","displayName"],"myself"]
[["actor","followersCount"],0]
[["actor","followingCount"],0]
[["actor","followingStocksCount"],0]
[["actor","id"],"person:stocktwits:183087"]
[["actor","image"],"http://avatars.stocktwits.com/production/183087/thumb-1350332393.png"]
[["actor","link"],"http://stocktwits.com/myselfbtc"]
[["actor","links",0,"href"],null]
[["actor","links",0,"rel"],"me"]
[["actor","links",0,"rel"]]
[["actor","links",0]]
[["actor","objectType"],"person"]
[["actor","preferredUsername"],"myselfbtc"]
[["actor","statusesCount"],2]
[["actor","summary"],null]
[["actor","tradingStrategy","approach"],"Technical"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0],"Forex"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0]]
[["actor","tradingStrategy","experience"],"Novice"]
[["actor","tradingStrategy","holdingPeriod"],"Day Trader"]
[["actor","tradingStrategy","holdingPeriod"]]
[["actor","tradingStrategy"]]
[["body"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["entities","chart","fullImage","link"],"http://charts.stocktwits.com/production/original_10047145.png"]
[["entities","chart","fullImage","link"]]
[["entities","chart","image","link"],"http://charts.stocktwits.com/production/small_10047145.png"]
[["entities","chart","image","link"]]
[["entities","chart","link"],"http://stks.co/iDEB"]
[["entities","chart","objectType"],"image"]
[["entities","chart","objectType"]]
[["entities","sentiment","basic"],"Bearish"]
[["entities","sentiment","basic"]]
[["entities","stocks",0,"displayName"],"Bitcoin"]
[["entities","stocks",0,"exchange"],"PRIVATE"]
[["entities","stocks",0,"industry"],null]
[["entities","stocks",0,"sector"],null]
[["entities","stocks",0,"stocktwits_id"],9659]
[["entities","stocks",0,"symbol"],"BCOIN"]
[["entities","stocks",0,"symbol"]]
[["entities","stocks",0]]
[["entities","video"],null]
[["entities","video"]]
[["gnip","language","value"],"en"]
[["gnip","language","value"]]
[["gnip","language"]]
[["id"],"tag:gnip.stocktwits.com:2012:note/10047145"]
[["inReplyTo","id"],"tag:gnip.stocktwits.com:2012:note/10046953"]
[["inReplyTo","objectType"],"comment"]
[["inReplyTo","objectType"]]
[["link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","id"],"note:stocktwits:10047145"]
[["object","link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","objectType"],"note"]
[["object","postedTime"],"2012-10-17T19:13:50Z"]
[["object","summary"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["object","updatedTime"],"2012-10-17T19:13:50Z"]
[["object","updatedTime"]]
[["provider","displayName"],"StockTwits"]
[["provider","link"],"http://stocktwits.com"]
[["provider","link"]]
[["verb"],"post"]
[["verb"]]