tf.feature_column.input_layer 特征顺序问题
先说结论
- tf.feature_column.input_layer()的api,会对传入的feature_columns进行排序,并不是按照输入顺序进行组织,排序依据基于feature_column的name(tf生成的,类似于'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'这种。
- 关键代码:
for column in sorted(feature_columns, key=lambda x: x.name):
      ordered_columns.append(column)
- 代码验证:
In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]
Out[31]:
['r_rsp113_indicator',
 'u_wu211_X_u_wu215_indicator',
 'u_wu211_indicator',
 'u_wu215_indicator']
表现
In [24]: u_wu211 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu211', vocabulary_list=['0','1','2'])
    ...: u_wu215 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu215', vocabulary_list=['00s','10s','90s'])
    ...: r_rsp113 = tf.feature_column.categorical_column_with_vocabulary_list(key='r_rsp113', vocabulary_list=['0','-1','1'])
    ...: u_wu211_u_wu215_cross = tf.feature_column.crossed_column(keys = [u_wu211, u_wu215], hash_bucket_size=3)
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu215)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(r_rsp113)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211_u_wu215_cross)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211),
    ...:   tf.feature_column.indicator_column(u_wu215),
    ...: tf.feature_column.indicator_column(r_rsp113),
    ...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)
    ...: ]))
    ...:
tf.Tensor(
[[1. 0. 0.]
 [0. 0. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 0. 0.]
 [1. 0. 0.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 1. 0.]
 [0. 1. 0.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 0. 1.]
 [0. 0. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0.]], shape=(2, 12), dtype=float32)
- 由第一条sample举例:期望得到的是u_wu211 + u_wu215 + r_rsp113 + u_wu211_u_wu215_cross- 即:[1. 0. 0.] + [0. 0. 0.] + [0. 1. 0.] + [0. 0. 1.]
- 但得到的却是:[0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.],也就是['r_rsp113', 'u_wu211_u_wu215_cross', 'u_wu211', 'u_wu215']
 
文档描述
    feature_columns: An iterable containing the FeatureColumns to use as inputs
      to your model. All items should be instances of classes derived from
      `_DenseColumn` such as `numeric_column`, `embedding_column`,
      `bucketized_column`, `indicator_column`. If you have categorical features,
      you can wrap them with an `embedding_column` or `indicator_column`.
- feature_columns参数接收一个:包含模型中使用到的FeatureColumns的一个迭代器,列表中的项目都应该是_DenseColumn类的实例化对象,例如numeric_column,embedding_column,bucketized_column,indicator_column.如果是标签类别的特征,需要用embedding_columnorindicator_column转换一下。
- 其中并未解释特征顺序相关问题。
源码探究
- tf.feature_column.input_layer
@tf_export(v1=['feature_column.input_layer'])
def input_layer(features,
                feature_columns,
                weight_collections=None,
                trainable=True,
                cols_to_vars=None,
                cols_to_output_tensors=None):
  """Returns a dense `Tensor` as input layer based on given `feature_columns`.
  Generally a single example in training data is described with FeatureColumns.
  At the first layer of the model, this column oriented data should be converted
  to a single `Tensor`.
  Example:
  ``python
  price = numeric_column('price')
  keywords_embedded = embedding_column(
      categorical_column_with_hash_bucket("keywords", 10K), dimensions=16)
  columns = [price, keywords_embedded, ...]
  features = tf.io.parse_example(..., features=make_parse_example_spec(columns))
  dense_tensor = input_layer(features, columns)
  for units in [128, 64, 32]:
    dense_tensor = tf.compat.v1.layers.dense(dense_tensor, units, tf.nn.relu)
  prediction = tf.compat.v1.layers.dense(dense_tensor, 1)
  ``
  Args:
    features: A mapping from key to tensors. `_FeatureColumn`s look up via these
      keys. For example `numeric_column('price')` will look at 'price' key in
      this dict. Values can be a `SparseTensor` or a `Tensor` depends on
      corresponding `_FeatureColumn`.
    feature_columns: An iterable containing the FeatureColumns to use as inputs
      to your model. All items should be instances of classes derived from
      `_DenseColumn` such as `numeric_column`, `embedding_column`,
      `bucketized_column`, `indicator_column`. If you have categorical features,
      you can wrap them with an `embedding_column` or `indicator_column`.
    weight_collections: A list of collection names to which the Variable will be
      added. Note that variables will also be added to collections
      `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
    trainable: If `True` also add the variable to the graph collection
      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
    cols_to_vars: If not `None`, must be a dictionary that will be filled with a
      mapping from `_FeatureColumn` to list of `Variable`s.  For example, after
      the call, we might have cols_to_vars =
      {_EmbeddingColumn(
        categorical_column=_HashedCategoricalColumn(
          key='sparse_feature', hash_bucket_size=5, dtype=tf.string),
        dimension=10): [<tf.Variable 'some_variable:0' shape=(5, 10),
                        <tf.Variable 'some_variable:1' shape=(5, 10)]}
      If a column creates no variables, its value will be an empty list.
    cols_to_output_tensors: If not `None`, must be a dictionary that will be
      filled with a mapping from '_FeatureColumn' to the associated
      output `Tensor`s.
  Returns:
    A `Tensor` which represents input layer of a model. Its shape
    is (batch_size, first_layer_dimension) and its dtype is `float32`.
    first_layer_dimension is determined based on given `feature_columns`.
  Raises:
    ValueError: if an item in `feature_columns` is not a `_DenseColumn`.
  """
  return _internal_input_layer(
      features,
      feature_columns,
      weight_collections=weight_collections,
      trainable=trainable,
      cols_to_vars=cols_to_vars,
      cols_to_output_tensors=cols_to_output_tensors)
- _internal_input_layer
def _internal_input_layer(features,
                          feature_columns,
                          weight_collections=None,
                          trainable=True,
                          cols_to_vars=None,
                          scope=None,
                          cols_to_output_tensors=None,
                          from_template=False):
  """See input_layer. `scope` is a name or variable scope to use."""
  feature_columns = _normalize_feature_columns(feature_columns)
  for column in feature_columns:
    if not isinstance(column, _DenseColumn):
      raise ValueError(
          'Items of feature_columns must be a _DenseColumn. '
          'You can wrap a categorical column with an '
          'embedding_column or indicator_column. Given: {}'.format(column))
  weight_collections = list(weight_collections or [])
  if ops.GraphKeys.GLOBAL_VARIABLES not in weight_collections:
    weight_collections.append(ops.GraphKeys.GLOBAL_VARIABLES)
  if ops.GraphKeys.MODEL_VARIABLES not in weight_collections:
    weight_collections.append(ops.GraphKeys.MODEL_VARIABLES)
  def _get_logits():  # pylint: disable=missing-docstring
    builder = _LazyBuilder(features)
    output_tensors = []
    ordered_columns = []
    for column in sorted(feature_columns, key=lambda x: x.name):
      ordered_columns.append(column)
      with variable_scope.variable_scope(
          None, default_name=column._var_scope_name):  # pylint: disable=protected-access
        tensor = column._get_dense_tensor(  # pylint: disable=protected-access
            builder,
            weight_collections=weight_collections,
            trainable=trainable)
        num_elements = column._variable_shape.num_elements()  # pylint: disable=protected-access
        batch_size = array_ops.shape(tensor)[0]
        output_tensor = array_ops.reshape(
            tensor, shape=(batch_size, num_elements))
        output_tensors.append(output_tensor)
        if cols_to_vars is not None:
          # Retrieve any variables created (some _DenseColumn's don't create
          # variables, in which case an empty list is returned).
          cols_to_vars[column] = ops.get_collection(
              ops.GraphKeys.GLOBAL_VARIABLES,
              scope=variable_scope.get_variable_scope().name)
        if cols_to_output_tensors is not None:
          cols_to_output_tensors[column] = output_tensor
    _verify_static_batch_size_equality(output_tensors, ordered_columns)
    return array_ops.concat(output_tensors, 1)
  # If we're constructing from the `make_template`, that by default adds a
  # variable scope with the name of the layer. In that case, we dont want to
  # add another `variable_scope` as that would break checkpoints.
  if from_template:
    return _get_logits()
  else:
    with variable_scope.variable_scope(
        scope, default_name='input_layer', values=features.values()):
      return _get_logits()
- 两处需要注意:
- 在_get_logits中,_LazyBuilder对重复引用的特征做了去重,并且延迟初始化
- 另外在添加特征中,引入了一个排序,基于feature_column的name(tf生成的,类似于'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'这种。
- 代码如下:
 
  def _get_logits():  # pylint: disable=missing-docstring
    builder = _LazyBuilder(features)
    output_tensors = []
    ordered_columns = []
    for column in sorted(feature_columns, key=lambda x: x.name):
      ordered_columns.append(column)
      with variable_scope.variable_scope(
          None, default_name=column._var_scope_name):  # pylint: disable=protected-access
结论验证
In [29]: fcs = [tf.feature_column.indicator_column(u_wu211),
    ...:   tf.feature_column.indicator_column(u_wu215),
    ...: tf.feature_column.indicator_column(r_rsp113),
    ...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)
    ...: ]
In [30]: sorted( fcs, key=lambda x: x.name)
Out[30]:
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='r_rsp113', vocabulary_list=('0', '-1', '1'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=CrossedColumn(keys=(VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), hash_bucket_size=3, hash_key=None)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]
Out[31]:
['r_rsp113_indicator',
 'u_wu211_X_u_wu215_indicator',
 'u_wu211_indicator',
 'u_wu215_indicator']
- 期望结果:
- ['r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator', 'u_wu211_indicator', 'u_wu215_indicator']
- 即: [0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.]
- [0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
- 与预期一致。
 
tf.feature_column.input_layer 特征顺序问题的更多相关文章
- tensorflow feature_column踩坑合集
		踩坑内容包含以下 feature_column的输入输出类型,用一个数据集给出demo feature_column接estimator feature_column接Keras feature_co ... 
- tf.estimator
		estimator同keras是tensorflow的高级API.在tensorflow1.13以上,estimator已经作为一个单独的package从tensorflow分离出来了.estimat ... 
- 使用movielens数据集动手实现youtube推荐候选集生成
		综述 之前在博客中总结过nce损失和YouTuBe DNN推荐;但大多都还是停留在理论层面,没有实践经验.所以笔者想借由此文继续深入探索YouTuBe DNN推荐,另外也进一步总结TensorFlow ... 
- CTR学习笔记&代码实现5-深度ctr模型 DeepCrossing -> DCN
		之前总结了PNN,NFM,AFM这类两两向量乘积的方式,这一节我们换新的思路来看特征交互.DeepCrossing是最早在CTR模型中使用ResNet的前辈,DCN在ResNet上进一步创新,为高阶特 ... 
- tensorflow创建自定义 Estimator
		https://www.tensorflow.org/guide/custom_estimators?hl=zh-cn 创建自定义 Estimator 本文档介绍了自定义 Estimator.具体而言 ... 
- 4. Tensorflow的Estimator实践原理
		1. Tensorflow高效流水线Pipeline 2. Tensorflow的数据处理中的Dataset和Iterator 3. Tensorflow生成TFRecord 4. Tensorflo ... 
- 创建自定义 Estimator
		ref 本文档介绍了自定义 Estimator.具体而言,本文档介绍了如何创建自定义 Estimator 来模拟预创建的 Estimator DNNClassifier 在解决鸢尾花问题时的行为.要详 ... 
- TensorFlow低阶API(一)—— 简介
		简介 本文旨在知道您使用低级别TensorFlow API(TensorFlow Core)开始编程.您可以学习执行以下操作: 管理自己的TensorFlow程序(tf.Graph)和TensorFl ... 
- 【推荐算法工程师技术栈系列】分布式&数据库--tensorflow
		目录 TensorFlow 高阶API Dataset(tf.data) Estimator(tf.estimator) FeatureColumns(tf.feature_column) tf.nn ... 
- CTR学习笔记&代码实现1-深度学习的前奏LR->FFM
		CTR学习笔记系列的第一篇,总结在深度模型称王之前经典LR,FM, FFM模型,这些经典模型后续也作为组件用于各个深度模型.模型分别用自定义Keras Layer和estimator来实现,哈哈一个是 ... 
随机推荐
- 使用require.context实现优雅的预加载
			前言 在前端开发中,对页面花里胡哨度[注1]要求越高的页面,用到的图片.音频什么的就越多,比如什么结婚请柬.展会请柬.发布会宣传页.数据大屏.虽然现在浏览器不允许网页在没有用户交互的情况下播放音频,但 ... 
- Prism Sample 3 自定义Region
			在例2中,我们使用了一个Region <ContentControl prism:RegionManager.RegionName="ContentRegion" /> ... 
- Oracle之table()函数的使用,提高查询效率
			目录 一.序言 二.table()函数使用步骤 三.table() 具体使用实例 3.1 table()结合数组 使用 3.2 table()结合PIPELINED函数(这次报表使用的方式) 3.3 ... 
- Protobuf: 高效数据传输的秘密武器
			当涉及到网络通信和数据存储时,数据序列化一直都是一个重要的话题:特别是现在很多公司都在推行微服务,数据序列化更是重中之重,通常会选择使用 JSON 作为数据交换格式,且 JSON 已经成为业界的主流. ... 
- 2022-12-12:有n个城市,城市从0到n-1进行编号。小美最初住在k号城市中 在接下来的m天里,小美每天会收到一个任务 她可以选择完成当天的任务或者放弃该任务 第i天的任务需要在ci号城市完成,
			2022-12-12:有n个城市,城市从0到n-1进行编号.小美最初住在k号城市中 在接下来的m天里,小美每天会收到一个任务 她可以选择完成当天的任务或者放弃该任务 第i天的任务需要在ci号城市完成, ... 
- vue全家桶进阶之路47:Vue3 Axios拦截器封装成request文件
			可以将Axios拦截器封装成一个单独的request文件,以便在整个应用程序中重复使用. 以下是一个示例,展示如何将Axios拦截器封装成一个request文件: 1.创建一个名为request.js ... 
- 流计算中kafka的OffsetReset策略
			朋友的公司做的是西南某边境省份网红新能源车的数据处理,由于新能源车的火爆,从年初从现在,数据量已经翻番.但与此同时,服务器却没有多少增加.无奈之下,只能暂时将kafka的数据存储时间由之前的1天改为6 ... 
- phpstudy-pikachu-数字型注入(post)
			抓包搞到格式 id=1&submit=%E6%9F%A5%E8%AF%A2 查字符段 id=1 order by 2&submit=%E6%9F%A5%E8%AF%A2 id=1 un ... 
- Metabase可视化BI系统部署安装及简单使用
			Metabase 是国外开源的一个可视化系统,语言使用了Clojure + TypeScript. Clojure(/ˈkloʊʒər/)是Lisp程式語言在Java平台上的現代.動態及函數式方言.来 ... 
- 【python基础】复杂数据类型-列表类型(排序/长度/遍历)
			1.列表数据元素排序 在创建的列表中,数据元素的排列顺序常常是无法预测的.这虽然在大多数情况下都是不可避免的,但经常需要以特定的顺序呈现信息.有时候希望保留列表数据元素最初的排列顺序,而有时候又需要调 ... 
