Numpy 从Tensorflow中的嵌入列获取嵌入向量_Numpy_Tensorflow_Deep Learning

Numpy 从Tensorflow中的嵌入列获取嵌入向量

numpy tensorflow deep-learning

Numpy 从Tensorflow中的嵌入列获取嵌入向量,numpy,tensorflow,deep-learning,Numpy,Tensorflow,Deep Learning,我想得到使用Tensorflow中的“嵌入列”创建的numpy向量例如，创建示例DF： sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"] sample_column2 = [1,2,1,3,4,6,2,1,3] ds = pd.DataFrame(sample_column1,columns=["A"]) ds["B"] = sample_column2 d

我想得到使用Tensorflow中的“嵌入列”创建的numpy向量

例如，创建示例DF：

sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds

将对象转换为Tensorflow对象

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):

    dataframe = dataframe.copy()
    labels = dataframe.pop('B')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    #print (ds)
    if shuffle:
       ds = ds.shuffle(buffer_size=len(dataframe))
    #print (ds)
    ds = ds.batch(batch_size)
    return ds

创建嵌入列：

tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
  'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)

是否仍然可以从“col\u a\u embedding”对象获取嵌入作为numpy向量

例如

类别“苹果”将嵌入尺寸为8的向量中：

[a1 a2 a3 a4 a5 a6 a7 a8]

我们可以获取该向量吗？

我看不到使用功能列获取所需内容的方法（我在

tf.feature\u列中没有看到名为sequence\u embedded\u column
或类似的函数）。因为要素列的结果似乎是一个固定长度的张量。他们通过使用组合器聚合单个嵌入向量（总和、平均值、sqrtn等）来实现这一点。因此，类别序列上的维度实际上丢失了
但如果您使用较低级别的API，这是完全可行的。
首先，您可以构造一个查找表，将分类字符串转换为ID
features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
    vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)

#Content of "fruit.txt"
apple
mango
banana
unknown


现在可以将嵌入初始化为2d变量。它的形状是[类别数量，嵌入维度]

num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
                "embedding_table", [num_categories, embedding_dim],
                initializer=tf.truncated_normal_initializer(stddev=0.02))

然后，您可以按如下方式查找类别嵌入：
ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)

注意，ids\u embeddings
中的是一个串联的长张量。请随意将其重塑为您想要的形状。
我建议最简单最快的方法就是这样做，这就是我在自己的应用程序中所做的：
使用pandas将文件读入类型为的字符串列
使用dtype参数在字段中输入“类别”。让我们称之为场
“f”。这是原始字符串列，还不是数字列
仍然在pandas中，创建一个新列并复制原始列的
将cat.CODE插入新列。我们把它称为字段“f_代码”。Pandas会自动将其编码为一个紧凑的数字列。它将具有传递给神经网络所需的数字
现在在keras函数api的嵌入层中
网络模型，将f_代码传递到模型的输入层。这个
f_代码中的值现在将是一个数字，如int8。嵌入
图层现在将正确处理它。不要将原始列传递给模型
下面是从我的项目中复制的一些示例代码行，它们完全按照上面的步骤进行
all_col_types_readcsv = {'userid':'int32','itemid':'int32','rating':'float32','user_age':'int32','gender':'category','job':'category','zipcode':'category'}

<some code omitted>

d = pd.read_csv(fn, sep='|', header=0, dtype=all_col_types_readcsv, encoding='utf-8', usecols=usecols_readcsv)

<some code omitted>

from pandas.api.types import is_string_dtype
# Select the columns to add code columns to. Numeric cols work fine with Embedding layer so ignore them.

cat_cols = [cn for cn in d.select_dtypes('category')]
print(cat_cols)
str_cols = [cn for cn in d.columns if is_string_dtype(d[cn])]
print(str_cols)
add_code_columns = [cn for cn in d.columns if (cn in cat_cols) and (cn in str_cols)]
print(add_code_columns)

<some code omitted>

# Actually add _code column for the selected columns
for cn in add_code_columns:
  codecolname = cn + "_code"
  if not codecolname in d.columns:
    d[codecolname] = d[cn].cat.codes

顺便说一下，在将所有数据帧传递到model.fit（）中时，请将它们包装在np.array（）周围。没有很好的文档记录，而且显然也没有在运行时检查数据帧不能安全地传递。您会得到大量内存分配，否则会导致主机崩溃。
很难理解您需要什么。你能举个例子吗？你说的“凹凸向量”是什么意思？@thushv89我想获取嵌入向量。每个类别都将嵌入到给定维度的向量中，我想得到该向量。@greeness抱歉，这是一个输入错误。我是说numpy，这是个好主意。谢谢。当我尝试使用TensorFlow 2.0时，我得到以下错误AttributeError:module'TensorFlow\u core.\u api.v2.lookup'has no attribute'index\u table\u from_file
在tf 2.0中，有一个类似的tf模块tf.lookup
。您可能需要切换到tf.lookup.StaticVocabularyTable。请看。
d.info()
d.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99991 entries, 0 to 99990
Data columns (total 5 columns):
userid      99991 non-null int32
itemid      99991 non-null int32
rating      99991 non-null float32
job         99991 non-null category
job_code    99991 non-null int8
dtypes: category(1), float32(1), int32(2), int8(1)
memory usage: 1.3 MB

v = Lambda(lambda z: z[:, field_num0_X_cols[cn]], output_shape=(), name="Parser_" + cn)(input_x)
emb_input = Lambda(lambda z: tf.expand_dims(z, axis=-1), output_shape=(1,), name="Expander_" + cn)(v)
a = Embedding(input_dim=num_uniques[cn]+1, output_dim=emb_len[cn], input_length=1, embeddings_regularizer=reg, name="E_" + cn)(emb_input)