tf.feature_column.input_layer 特征顺序问题

先说结论

tf.feature_column.input_layer()的api，会对传入的feature_columns进行排序，并不是按照输入顺序进行组织，排序依据基于feature_column的name（tf生成的，类似于'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'这种。
关键代码：

for column in sorted(feature_columns, key=lambda x: x.name):

      ordered_columns.append(column)

代码验证：

In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]

Out[31]:

['r_rsp113_indicator',

 'u_wu211_X_u_wu215_indicator',

 'u_wu211_indicator',

 'u_wu215_indicator']

表现

In [24]: u_wu211 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu211', vocabulary_list=['0','1','2'])

    ...: u_wu215 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu215', vocabulary_list=['00s','10s','90s'])

    ...: r_rsp113 = tf.feature_column.categorical_column_with_vocabulary_list(key='r_rsp113', vocabulary_list=['0','-1','1'])

    ...: u_wu211_u_wu215_cross = tf.feature_column.crossed_column(keys = [u_wu211, u_wu215], hash_bucket_size=3)

    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211)]))

    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu215)]))

    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(r_rsp113)]))

    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211_u_wu215_cross)]))

    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211),

    ...:   tf.feature_column.indicator_column(u_wu215),

    ...: tf.feature_column.indicator_column(r_rsp113),

    ...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)

    ...: ]))

    ...:

tf.Tensor(

[[1. 0. 0.]

 [0. 0. 1.]], shape=(2, 3), dtype=float32)

tf.Tensor(

[[0. 0. 0.]

 [1. 0. 0.]], shape=(2, 3), dtype=float32)

tf.Tensor(

[[0. 1. 0.]

 [0. 1. 0.]], shape=(2, 3), dtype=float32)

tf.Tensor(

[[0. 0. 1.]

 [0. 0. 1.]], shape=(2, 3), dtype=float32)

tf.Tensor(

[[0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]

 [0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0.]], shape=(2, 12), dtype=float32)

由第一条sample举例：期望得到的是u_wu211 + u_wu215 + r_rsp113 + u_wu211_u_wu215_cross
- 即：[1. 0. 0.] + [0. 0. 0.] + [0. 1. 0.] + [0. 0. 1.]
- 但得到的却是：[0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.]，也就是['r_rsp113', 'u_wu211_u_wu215_cross', 'u_wu211', 'u_wu215']

文档描述

    feature_columns: An iterable containing the FeatureColumns to use as inputs

      to your model. All items should be instances of classes derived from

      `_DenseColumn` such as `numeric_column`, `embedding_column`,

      `bucketized_column`, `indicator_column`. If you have categorical features,

      you can wrap them with an `embedding_column` or `indicator_column`.

feature_columns参数接收一个：包含模型中使用到的FeatureColumns的一个迭代器，列表中的项目都应该是_DenseColumn类的实例化对象，例如numeric_column, embedding_column, bucketized_column, indicator_column.如果是标签类别的特征，需要用embedding_column or indicator_column转换一下。
其中并未解释特征顺序相关问题。

源码探究

tf.feature_column.input_layer

@tf_export(v1=['feature_column.input_layer'])

def input_layer(features,

                feature_columns,

                weight_collections=None,

                trainable=True,

                cols_to_vars=None,

                cols_to_output_tensors=None):

  """Returns a dense `Tensor` as input layer based on given `feature_columns`.

  Generally a single example in training data is described with FeatureColumns.

  At the first layer of the model, this column oriented data should be converted

  to a single `Tensor`.

  Example:

  ``python

  price = numeric_column('price')

  keywords_embedded = embedding_column(

      categorical_column_with_hash_bucket("keywords", 10K), dimensions=16)

  columns = [price, keywords_embedded, ...]

  features = tf.io.parse_example(..., features=make_parse_example_spec(columns))

  dense_tensor = input_layer(features, columns)

  for units in [128, 64, 32]:

    dense_tensor = tf.compat.v1.layers.dense(dense_tensor, units, tf.nn.relu)

  prediction = tf.compat.v1.layers.dense(dense_tensor, 1)

  ``

  Args:

    features: A mapping from key to tensors. `_FeatureColumn`s look up via these

      keys. For example `numeric_column('price')` will look at 'price' key in

      this dict. Values can be a `SparseTensor` or a `Tensor` depends on

      corresponding `_FeatureColumn`.

    feature_columns: An iterable containing the FeatureColumns to use as inputs

      to your model. All items should be instances of classes derived from

      `_DenseColumn` such as `numeric_column`, `embedding_column`,

      `bucketized_column`, `indicator_column`. If you have categorical features,

      you can wrap them with an `embedding_column` or `indicator_column`.

    weight_collections: A list of collection names to which the Variable will be

      added. Note that variables will also be added to collections

      `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.

    trainable: If `True` also add the variable to the graph collection

      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).

    cols_to_vars: If not `None`, must be a dictionary that will be filled with a

      mapping from `_FeatureColumn` to list of `Variable`s.  For example, after

      the call, we might have cols_to_vars =

      {_EmbeddingColumn(

        categorical_column=_HashedCategoricalColumn(

          key='sparse_feature', hash_bucket_size=5, dtype=tf.string),

        dimension=10): [<tf.Variable 'some_variable:0' shape=(5, 10),

                        <tf.Variable 'some_variable:1' shape=(5, 10)]}

      If a column creates no variables, its value will be an empty list.

    cols_to_output_tensors: If not `None`, must be a dictionary that will be

      filled with a mapping from '_FeatureColumn' to the associated

      output `Tensor`s.

  Returns:

    A `Tensor` which represents input layer of a model. Its shape

    is (batch_size, first_layer_dimension) and its dtype is `float32`.

    first_layer_dimension is determined based on given `feature_columns`.

  Raises:

    ValueError: if an item in `feature_columns` is not a `_DenseColumn`.

  """

  return _internal_input_layer(

      features,

      feature_columns,

      weight_collections=weight_collections,

      trainable=trainable,

      cols_to_vars=cols_to_vars,

      cols_to_output_tensors=cols_to_output_tensors)

_internal_input_layer



def _internal_input_layer(features,

                          feature_columns,

                          weight_collections=None,

                          trainable=True,

                          cols_to_vars=None,

                          scope=None,

                          cols_to_output_tensors=None,

                          from_template=False):

  """See input_layer. `scope` is a name or variable scope to use."""

  feature_columns = _normalize_feature_columns(feature_columns)

  for column in feature_columns:

    if not isinstance(column, _DenseColumn):

      raise ValueError(

          'Items of feature_columns must be a _DenseColumn. '

          'You can wrap a categorical column with an '

          'embedding_column or indicator_column. Given: {}'.format(column))

  weight_collections = list(weight_collections or [])

  if ops.GraphKeys.GLOBAL_VARIABLES not in weight_collections:

    weight_collections.append(ops.GraphKeys.GLOBAL_VARIABLES)

  if ops.GraphKeys.MODEL_VARIABLES not in weight_collections:

    weight_collections.append(ops.GraphKeys.MODEL_VARIABLES)

  def _get_logits():  # pylint: disable=missing-docstring

    builder = _LazyBuilder(features)

    output_tensors = []

    ordered_columns = []

    for column in sorted(feature_columns, key=lambda x: x.name):

      ordered_columns.append(column)

      with variable_scope.variable_scope(

          None, default_name=column._var_scope_name):  # pylint: disable=protected-access

        tensor = column._get_dense_tensor(  # pylint: disable=protected-access

            builder,

            weight_collections=weight_collections,

            trainable=trainable)

        num_elements = column._variable_shape.num_elements()  # pylint: disable=protected-access

        batch_size = array_ops.shape(tensor)[0]

        output_tensor = array_ops.reshape(

            tensor, shape=(batch_size, num_elements))

        output_tensors.append(output_tensor)

        if cols_to_vars is not None:

          # Retrieve any variables created (some _DenseColumn's don't create

          # variables, in which case an empty list is returned).

          cols_to_vars[column] = ops.get_collection(

              ops.GraphKeys.GLOBAL_VARIABLES,

              scope=variable_scope.get_variable_scope().name)

        if cols_to_output_tensors is not None:

          cols_to_output_tensors[column] = output_tensor

    _verify_static_batch_size_equality(output_tensors, ordered_columns)

    return array_ops.concat(output_tensors, 1)

  # If we're constructing from the `make_template`, that by default adds a

  # variable scope with the name of the layer. In that case, we dont want to

  # add another `variable_scope` as that would break checkpoints.

  if from_template:

    return _get_logits()

  else:

    with variable_scope.variable_scope(

        scope, default_name='input_layer', values=features.values()):

      return _get_logits()

两处需要注意：
- 在_get_logits中，_LazyBuilder对重复引用的特征做了去重，并且延迟初始化
- 另外在添加特征中，引入了一个排序，基于feature_column的name（tf生成的，类似于'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'这种。
- 代码如下：

  def _get_logits():  # pylint: disable=missing-docstring

    builder = _LazyBuilder(features)

    output_tensors = []

    ordered_columns = []

    for column in sorted(feature_columns, key=lambda x: x.name):

      ordered_columns.append(column)

      with variable_scope.variable_scope(

          None, default_name=column._var_scope_name):  # pylint: disable=protected-access

结论验证

In [29]: fcs = [tf.feature_column.indicator_column(u_wu211),

    ...:   tf.feature_column.indicator_column(u_wu215),

    ...: tf.feature_column.indicator_column(r_rsp113),

    ...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)

    ...: ]

In [30]: sorted( fcs, key=lambda x: x.name)

Out[30]:

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='r_rsp113', vocabulary_list=('0', '-1', '1'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),

 IndicatorColumn(categorical_column=CrossedColumn(keys=(VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), hash_bucket_size=3, hash_key=None)),

 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),

 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]

Out[31]:

['r_rsp113_indicator',

 'u_wu211_X_u_wu215_indicator',

 'u_wu211_indicator',

 'u_wu215_indicator']

期望结果：
['r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator', 'u_wu211_indicator', 'u_wu215_indicator']
- 即： [0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.]
- [0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
- 与预期一致。

tf.feature_column.input_layer 特征顺序问题的更多相关文章

tensorflow feature_column踩坑合集
踩坑内容包含以下 feature_column的输入输出类型,用一个数据集给出demo feature_column接estimator feature_column接Keras feature_co ...
tf.estimator
estimator同keras是tensorflow的高级API.在tensorflow1.13以上,estimator已经作为一个单独的package从tensorflow分离出来了.estimat ...
使用movielens数据集动手实现youtube推荐候选集生成
综述之前在博客中总结过nce损失和YouTuBe DNN推荐;但大多都还是停留在理论层面,没有实践经验.所以笔者想借由此文继续深入探索YouTuBe DNN推荐,另外也进一步总结TensorFlow ...
CTR学习笔记&代码实现5-深度ctr模型 DeepCrossing -> DCN
之前总结了PNN,NFM,AFM这类两两向量乘积的方式,这一节我们换新的思路来看特征交互.DeepCrossing是最早在CTR模型中使用ResNet的前辈,DCN在ResNet上进一步创新,为高阶特 ...
tensorflow创建自定义 Estimator
https://www.tensorflow.org/guide/custom_estimators?hl=zh-cn 创建自定义 Estimator 本文档介绍了自定义 Estimator.具体而言 ...
4. Tensorflow的Estimator实践原理
1. Tensorflow高效流水线Pipeline 2. Tensorflow的数据处理中的Dataset和Iterator 3. Tensorflow生成TFRecord 4. Tensorflo ...
创建自定义 Estimator
ref 本文档介绍了自定义 Estimator.具体而言,本文档介绍了如何创建自定义 Estimator 来模拟预创建的 Estimator DNNClassifier 在解决鸢尾花问题时的行为.要详 ...
TensorFlow低阶API（一）—— 简介
简介本文旨在知道您使用低级别TensorFlow API(TensorFlow Core)开始编程.您可以学习执行以下操作: 管理自己的TensorFlow程序(tf.Graph)和TensorFl ...
【推荐算法工程师技术栈系列】分布式&数据库--tensorflow
目录 TensorFlow 高阶API Dataset(tf.data) Estimator(tf.estimator) FeatureColumns(tf.feature_column) tf.nn ...
CTR学习笔记&代码实现1-深度学习的前奏LR->FFM
CTR学习笔记系列的第一篇,总结在深度模型称王之前经典LR,FM, FFM模型,这些经典模型后续也作为组件用于各个深度模型.模型分别用自定义Keras Layer和estimator来实现,哈哈一个是 ...

随机推荐

Node + Express 后台开发 —— 登录标识
登录标识系统通常只有登录成功后才能访问,而 http 是无状态的.倘若直接请求需要登录才可访问的接口,假如后端反复查询数据库,而且每个请求还得带上用户名和密码,这都是不很好. 作为前端,我们听过 c ...
使用STM32CubeMX生成ThreadX实时操作系统工程模板
博客主页:链接.转载请注明出处! 由于需要在stm32上使用USB Host CDC-ECM,连接EC20发送数据到服务器,接触到了ThreadX实时操作系统. 在调研过程中,发现stm32官方USB ...
2023-03-04：定义一个二维数组N*M，比如5*5数组下所示： 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
2023-03-04:定义一个二维数组NM,比如55数组下所示: 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0 ...
2022-06-06：大妈一开始手上有x个鸡蛋，她想让手上的鸡蛋数量变成y，操作1 : 从仓库里拿出1个鸡蛋到手上，x变成x+1个，操作2 : 如果手上的鸡蛋数量是3的整数倍，大妈可以直接把三分之
2022-06-06:大妈一开始手上有x个鸡蛋,她想让手上的鸡蛋数量变成y, 操作1 : 从仓库里拿出1个鸡蛋到手上,x变成x+1个, 操作2 : 如果手上的鸡蛋数量是3的整数倍,大妈可以直接把三分之 ...
2022-04-09：给你两个长度分别 n 和 m 的整数数组 nums 和 multipliers ，其中 n ＞= m ，数组下标从 1 开始计数。初始时，你的分数为 0 。你需要执行恰
2022-04-09:给你两个长度分别 n 和 m 的整数数组 nums 和 multipliers ,其中 n >= m , 数组下标从 1 开始计数. 初始时,你的分数为 0 . 你需要 ...
vue全家桶进阶之路4：NPM包
NPM(Node Package Manager)是 Node.js 的包管理工具,用来安装各种 Node.js 的扩展. NPM是 JavaScript 的包管理工具,也是世界上最大的软件注册表.有 ...
springboot~国际化Locale正确的姿势
Java中的Locale.getDefault()获取的是操作系统的默认区域设置,如果需要获取客户端浏览器的区域设置,可以从HTTP头中获取"Accept-Language"的值来 ...
记一次处理挖矿程序引发的postgres 连接超时
近一段时间内发现自己的服务器总是警告被挖矿,然处理挖矿程序中也引发了许多其他的问题,也从中学到了其他的知识,趁今天未加班梳理一下便于巩固,记录日常文章目录一.查找进程 1.使用 ll /proc/ ...
Python基础 - 第一个python程序
Python程序是什么? Python源程序就是一个特殊格式的文本文件,可以使用任意文本编辑器软件做python的开发,python的文件扩展名为 .py 执行python程序的三种方式直接调用解释 ...
使用 coding.net 发布你的个人博客
微信文章不允许外链,本文章的静态示例站点,可在文章左下角 "阅读原文" 进行预览. 很多人喜欢在 github pages / gitee pages 发布自己的个人博客,前者由于 ...

tf.feature_column.input_layer 特征顺序问题

先说结论

表现

文档描述

源码探究

结论验证

tf.feature_column.input_layer 特征顺序问题的更多相关文章

随机推荐

热门专题