如何让空值不存储在Python中的HBase中？_Python_Pandas_Hive_Hbase

如何让空值不存储在Python中的HBase中？

python pandas hive hbase

如何让空值不存储在Python中的HBase中？,python,pandas,hive,hbase,Python,Pandas,Hive,Hbase,我有一些样本数据如下： test_a test_b test_c test_d test_date ------------------------------------------------- 1 a 500 0.1 111 20191101 2 a NaN 0.2 NaN 20191102 3 a 200 0.

我有一些样本数据如下：

    test_a      test_b   test_c   test_d   test_date
    -------------------------------------------------
1   a           500      0.1      111      20191101
2   a           NaN      0.2      NaN      20191102
3   a           200      0.1      111      20191103
4   a           400      NaN      222      20191104
5   a           NaN      0.2      333      20191105

我想让这些数据存储在Hbase中，我使用下面的代码来实现它

from test.db import impala, hbasecon, HiveClient
import pandas as pd

sql = """
    SELECT test_a
            ,test_b
            ,test_c
            ,test_d
            ,test_date
    FROM table_test
    """

conn_impa = HiveClient().getcon()
all_df = pd.read_sql(sql=sql, con=conn_impa, chunksize=50000)

num = 0

for df in all_df:
    df = df.fillna('')
    df["s"] = df["test_d"] + df["test_date"]
    tmp_num = len(df)
    if len(df) > 0:
        with hintltable.batch(batch_size=1000) as b:
            df.apply(lambda row: b.put(row["k"], {
                'test:test_a': str(row["test_a"]),
                'test:test_b': str(row["test_b"]),
                'test:test_c': str(row["test_c"]),
            }), axis=1)

            num += len(df)

当我查询Hbase

get'test'，a201911012'

时，我得到以下结果：

COLUMN                           CELL                                                                                         
 test:test_a                      timestamp=1578389750838, value=a                                                              
 test:test_b                      timestamp=1578389788675, value=                                                              
 test:test_c                      timestamp=1578389775471, value=0.2                                                              
 test:test_d                      timestamp=1578449081388, value=

如何确保Python中的HBase中不存储空值？我们不需要null或空字符串值，我们的预期结果是：

COLUMN                           CELL                                                                                         
 test:test_a                      timestamp=1578389750838, value=a                                                                                                                       
 test:test_c                      timestamp=1578389775471, value=0.2

您应该能够通过创建自定义函数并在lambda函数中调用它来实现这一点。例如，您可以有一个函数-

def makeEntry(a, b, c):
    entrydict = {}
    ## using the fact that NaN == NaN is supposed to be False and empty strings are Falsy
    if(a==a and a):
        entrydict ["test:test_a"] = str(a)
    if(b==b and b):
        entrydict ["test:test_b"] = str(b)
    if(c==c and c):
        entrydict ["test:test_c"] = str(c)
    return entrydict

然后您可以将应用函数更改为-

df.apply(lambda row: b.put(row["k"],
makeEntry(row["test_a"],row["test_b"],row["test_c"])), axis=1)

通过这种方式，您只输入了不是NaN的值，而不是所有的值。

非常感谢您的回答，我尝试了您的方法，我在

dict[“test:test_a”]=str（a）

，TypeError:（“'type'对象不支持项分配”，u'发生在索引0'）@nullfearless ohh现在应该没事了，当我在重命名dict后没有更改所有变量名时，我把事情搞砸了，它们都应该是

entrydict

非常感谢你，你救了我一天，我刚刚发现我的数据中有

None

值，你知道如何忽略它们吗我可以使用

if（a==a，a不是None）

@nullfearless函数应忽略

None

值（除非它是

“None”

字符串），因为所有None值都是错误的，但如果（a==a且a不是None），则可以使用

。