Python Pandas—在列中作为字典加载转储的json时提高内存使用率 背景

Python Pandas—在列中作为字典加载转储的json时提高内存使用率 背景,python,json,pandas,dictionary,Python,Json,Pandas,Dictionary,我使用pandas从DB(redshift)加载一个大数据集(最多几百万行),DB中的一些列最初是以字符串形式保存的json。加载数据后,我使用json.loads将包含转储json对象的单元格转换为字典。这些对象每个单元大约有2.5MB

所有操作都在内存中完成,我宁愿避免使用HDFS或其他磁盘解决方案,因为数


解决方案尝试失败 使用另一个模块解析字典- 我尝试过使用ujson、simplejson和ast.literal_eval代替json模块,但在性能上没有明显的差异


代码 数据帧作为self.df保存在类中,这是处理转换的代码:

def turnAllJsonColumnsToDict(self):
    Scan the df and turn column to json if it's a string in json format
    print 'Checking columns for needed type conversions'
    for col in self.df.columns:
        if self.check_if_json(self.df[col].iloc[0], col):

def check_if_json(col, col_name, should_print=True):
    if isinstance(col, basestring):
            if col[0] in ('{', '[') and col[-1] in ('}', ']'):
                if should_print:
                    print 'Converting', col_name
                return True
        except IndexError as e:
            if should_print:
                print 'Failed to check ', col_name

def loadColumnAsJson(self, column):
    self.json_load_fail_counter = 0
    self.df[column] = map(self.loadJson, self.df[column])
    print '{failed} cells failed to be parsed to json in {column} (out of {rows})'.format(
        failed=self.json_load_fail_counter, column=column,

def loadJson(self, value):
    if not value:
        return value

    # the json dump is not in a python structure
    null, true, false = None, True, False
    value = self.fix_dumped_json(value)
        if type(value) == dict:
            return value
            return json.loads(value) if value else {}
    except Exception as e:
        self.json_load_fail_counter += 1
    return {}

def fix_dumped_json(value):
    value = value.replace('"None"', 'null')
    counter = 0
    for i in value:
        counter += 1 if i == '{' else -1 if i == '}' else 0
    if counter > 0:
        for c in range(counter):
            value += '}'
    return value