Linux内核中块层上的多队列

如果你想知道SSD为什么使用多队列，可以看看这篇文章：https://kernel.dk/blk-mq.pdf

1. 多块层

以下关于多队列层的总结来自 The Multi-Queue Interface Article，Linux kernel git 展示了如何转换为blk-mq。

blk_mq 的API实现了两级块层设计，该设计使用两组独立的请求队列。

软件暂存队列，按CPU分配；
硬件调度队列，其数量通常与blcok设备支持的实际硬件队列数量匹配。

如上图所示，软件暂存队列和硬件调度队列之间的映射队列数不同。

现在，我们考虑两个队列放置不同的情况。

假设这里有3种情况：

软件暂存队列 > 硬件调度队列

在这种情况下，两个或多个软件暂存队列被分配到一个硬件上下文中。而在硬件上下文将从所有关联的软件队列中拉入请求的同时，进行一次调度。

软件暂存队列 < 硬件调度队列

在这种情况下，软件暂存队列和硬件调度队列之间的映射是有顺序的。

软件暂存队列 == 硬件调度队列

在这种情况下，这是最简单的情况，即执行直接1：1映射。

2. 多队列块层中的主要数据结构

2.1 基本结构

2.2 blk_mq_reg（在内核4.5中）

根据 The Multi-Queue Interface Article，blk_mq_reg 结构包含了一个新的块设备向块层注册时需要的所有重要信息。

这个数据结构包括指向 blk_mq_ops 数据结构的指针，用于跟踪多队列块层与设备驱动交互的具体例程。

blk_mq_reg 结构还保存了需要初始化的硬件队列数量等。

但是，blk_mq_reg 已经不复存在。我们需要通过 blk_mq_ops 来了解块层和块设备之间的操作。

因此你可以在 kernel 3.15 中找到以下数据结构：

struct blk_mq_reg {

  struct blk_mq_ops       *ops;

  unsigned int            nr_hw_queues;

  unsigned int            queue_depth;

  unsigned int            reserved_tags;

  unsigned int            cmd_size;       /* per-request extra data */

  int                     numa_node;

  unsigned int            timeout;

  unsigned int            flags;          /* BLK_MQ_F_* */

};

但是，在内核4.5中，你已经无法找到。

我认为在内核4.5中，此数据结构已更改为 struct blk_mq_tag_set *set

struct blk_mq_tag_set {

  struct blk_mq_ops       *ops;

  unsigned int            nr_hw_queues;

  unsigned int            queue_depth;    /* max hw supported */

  unsigned int            reserved_tags;

  unsigned int            cmd_size;       /* per-request extra data */

  int                     numa_node;

  unsigned int            timeout;

  unsigned int            flags;          /* BLK_MQ_F_* */

  void                    *driver_data;

  struct blk_mq_tags      **tags;

  struct mutex            tag_list_lock;

  struct list_head        tag_list;

};

因为内核3.15中的函数 function(struct request_queue * blk_mq_init_queue(struct blk_mq_reg * reg，void * driver_data)) ，在内核4.5中已经更改为 function(struct request_queue * blk_mq_init_queue(struct blk_mq_tag_set * set))

2.3 blk_mq_ops 结构（在内核4.5中）

如上文中所述，此数据结构用于多队列块层与块设备层进行通信。

在此数据结构中，执行 blk_mq_hw_ctx 和 blk_mq_ctx 之间上下文映射的函数存储在 map_queue 字段中。

struct blk_mq_ops {

  /*

   * Queue request

   */

  queue_rq_fn             *queue_rq; // this part

  /*

   * Map to specific hardware queue

   */

  map_queue_fn            *map_queue; // this part

  /*

   * Called on request timeout

   */

  timeout_fn              *timeout;

  /*

   * Called to poll for completion of a specific tag.

   */

  poll_fn                 *poll;

  softirq_done_fn         *complete;

  /*

   * Called when the block layer side of a hardware queue has been

   * set up, allowing the driver to allocate/init matching structures.

   * Ditto for exit/teardown.

   */

  init_hctx_fn            *init_hctx;

  exit_hctx_fn            *exit_hctx;

  /*

   * Called for every command allocated by the block layer to allow

   * the driver to set up driver specific data.

   *

   * Tag greater than or equal to queue_depth is for setting up

   * flush request.

   *

   * Ditto for exit/teardown.

   */

  init_request_fn         *init_request;

  exit_request_fn         *exit_request;

};

2.4 blk_mq_hw_ctx 结构（在内核4.5中）

blk_mq_hw_ctx 结构表示与 request_queue 关联的硬件上下文。

这个对应的结构是内核4.5中的 blk_mq_ctx 结构。

struct blk_mq_hw_ctx {

        struct {

                spinlock_t              lock;

                struct list_head        dispatch;

        } ____cacheline_aligned_in_smp;

        unsigned long           state;          /* BLK_MQ_S_* flags */

        struct delayed_work     run_work;

        struct delayed_work     delay_work;

        cpumask_var_t           cpumask;

        int                     next_cpu;

        int                     next_cpu_batch;

        unsigned long           flags;          /* BLK_MQ_F_* flags */

        struct request_queue    *queue;

        struct blk_flush_queue  *fq;

        void                    *driver_data;

        struct blk_mq_ctxmap    ctx_map;

        unsigned int            nr_ctx;

        struct blk_mq_ctx       **ctxs;

        atomic_t                wait_index;

        struct blk_mq_tags      *tags;

        unsigned long           queued;

        unsigned long           run;

#define BLK_MQ_MAX_DISPATCH_ORDER       10

        unsigned long           dispatched[BLK_MQ_MAX_DISPATCH_ORDER];

        unsigned int            numa_node;

        unsigned int            queue_num;

        atomic_t                nr_active;

        struct blk_mq_cpu_notifier      cpu_notifier;

        struct kobject          kobj;

        unsigned long           poll_invoked;

        unsigned long           poll_success;

};

2.5 blk_mq_ctx 结构（在内核4.5中）

如上文所述，blk_mq_ctx 作为软件暂存队列已分配给每个CPU。

struct blk_mq_ctx {

  struct {

    spinlock_t              lock;

    struct list_head        rq_list;

  }  ____cacheline_aligned_in_smp;

  unsigned int            cpu;

  unsigned int            index_hw;

  unsigned int            last_tag ____cacheline_aligned_in_smp;

  /* incremented at dispatch time */

  unsigned long           rq_dispatched[2];

  unsigned long           rq_merged;

  /* incremented at completion time */

  unsigned long           ____cacheline_aligned_in_smp rq_completed[2];

  struct request_queue    *queue;

  struct kobject          kobj;

} ____cacheline_aligned_in_smp;

2.6 request_queue 结构（在内核4.5中）

blk_mq_hw_ctx 和 blk_mq_ctx 之间的上下文映射是建立在blk_mq_ops 结构的 map_queue 字段上的。在内核4.5中，这个映射仍然是 mq_map 字段，在与块设备相关的 request_queue 数据结构中。

struct request_queue {

  /*

   * Together with queue_head for cacheline sharing

   */

  struct list_head        queue_head;

  struct request          *last_merge;

  struct elevator_queue   *elevator;

  int                     nr_rqs[2];      /* # allocated [a]sync rqs */

  int                     nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

  /*

   * If blkcg is not used, @q->root_rl serves all requests.  If blkcg

   * is used, root blkg allocates from @q->root_rl and all other

   * blkgs from their own blkg->rl.  Which one to use should be

   * determined using bio_request_list().

   */

  struct request_list     root_rl;

  request_fn_proc         *request_fn;

  make_request_fn         *make_request_fn;

  prep_rq_fn              *prep_rq_fn;

  unprep_rq_fn            *unprep_rq_fn;

  softirq_done_fn         *softirq_done_fn;

  rq_timed_out_fn         *rq_timed_out_fn;

  dma_drain_needed_fn     *dma_drain_needed;

  lld_busy_fn             *lld_busy_fn;

  struct blk_mq_ops       *mq_ops;

  unsigned int            *mq_map;

  /* sw queues */

  struct blk_mq_ctx __percpu      *queue_ctx;

  unsigned int            nr_queues;

  /* hw dispatch queues */

  struct blk_mq_hw_ctx    **queue_hw_ctx;

  unsigned int            nr_hw_queues;

  /*

   * Dispatch queue sorting

   */

  sector_t                end_sector;

  struct request          *boundary_rq;

  /*

   * Delayed queue handling

   */

  struct delayed_work     delay_work;

  struct backing_dev_info backing_dev_info;

  /*

   * The queue owner gets to use this for whatever they like.

   * ll_rw_blk doesn't touch it.

   */

  void                    *queuedata;

  /*

   * various queue flags, see QUEUE_* below

   */

  unsigned long           queue_flags;

  /*

   * ida allocated id for this queue.  Used to index queues from

   * ioctx.

   */

  int                     id;

  /*

   * queue needs bounce pages for pages above this limit

   */

  gfp_t                   bounce_gfp;

  /*

   * protects queue structures from reentrancy. ->__queue_lock should

   * _never_ be used directly, it is queue private. always use

   * ->queue_lock.

   */

  spinlock_t              __queue_lock;

  spinlock_t              *queue_lock;

  /*

   * queue kobject

   */

  struct kobject kobj;

  /*

   * mq queue kobject

   */

  struct kobject mq_kobj;

  #ifdef  CONFIG_BLK_DEV_INTEGRITY

  struct blk_integrity integrity;

  #endif  /* CONFIG_BLK_DEV_INTEGRITY */

  #ifdef CONFIG_PM

  struct device           *dev;

  int                     rpm_status;

  unsigned int            nr_pending;

  #endif

  /*

   * queue settings

   */

  unsigned long           nr_requests;    /* Max # of requests */

  unsigned int            nr_congestion_on;

  unsigned int            nr_congestion_off;

  unsigned int            nr_batching;

  unsigned int            dma_drain_size;

  void                    *dma_drain_buffer;

  unsigned int            dma_pad_mask;

  unsigned int            dma_alignment;

  struct blk_queue_tag    *queue_tags;

  struct list_head        tag_busy_list;

  unsigned int            nr_sorted;

  unsigned int            in_flight[2];

  /*

   * Number of active block driver functions for which blk_drain_queue()

   * must wait. Must be incremented around functions that unlock the

   * queue_lock internally, e.g. scsi_request_fn().

   */

  unsigned int            request_fn_active;

  unsigned int            rq_timeout;

  struct timer_list       timeout;

  struct work_struct      timeout_work;

  struct list_head        timeout_list;

  struct list_head        icq_list;

  #ifdef CONFIG_BLK_CGROUP

  DECLARE_BITMAP          (blkcg_pols, BLKCG_MAX_POLS);

  struct blkcg_gq         *root_blkg;

  struct list_head        blkg_list;

  #endif

  struct queue_limits     limits;

  /*

   * sg stuff

   */

  unsigned int            sg_timeout;

  unsigned int            sg_reserved_size;

  int                     node;

  #ifdef CONFIG_BLK_DEV_IO_TRACE

  struct blk_trace        *blk_trace;

  #endif

  /*

   * for flush operations

   */

  unsigned int            flush_flags;

  unsigned int            flush_not_queueable:1;

  struct blk_flush_queue  *fq;

  struct list_head        requeue_list;

  spinlock_t              requeue_lock;

  struct work_struct      requeue_work;

  struct mutex            sysfs_lock;

  int                     bypass_depth;

  atomic_t                mq_freeze_depth;

  #if defined(CONFIG_BLK_DEV_BSG)

  bsg_job_fn              *bsg_job_fn;

  int                     bsg_job_size;

  struct bsg_class_device bsg_dev;

  #endif

  #ifdef CONFIG_BLK_DEV_THROTTLING

  /* Throttle data */

  struct throtl_data *td;

  #endif

  struct rcu_head         rcu_head;

  wait_queue_head_t       mq_freeze_wq;

  struct percpu_ref       q_usage_counter;

  struct list_head        all_q_node;

  struct blk_mq_tag_set   *tag_set;

  struct list_head        tag_set_list;

  struct bio_set          *bio_split;

  bool                    mq_sysfs_init_done;

};

3. 队列初始化

当一个新的使用多队列 API 的设备驱动被加载时，它会创建并初始化一个新的 blk_mq_ops 结构，并将一个新的 blk_mq_reg 的相关指针设置为它的地址。

更详细的说，除了下面的结构，其他的操作现在都是严格要求的。

但是，为了在上下文分配或 I/O 请求完成时执行特定的操作，可以指定其他操作。

作为必要的数据，驱动程序必须初始化它所支持的提交队列的数量，以及它们的大小。

其他数据也是必须的，以确定驱动所支持的命令大小，以及必须暴露给块层的特定标志。

但是，在内核4.5版本中，struct blk_mq_tag_set 很重要，上面的工作就是在这个 struct blk_mq_tag_set 中实现的。

3.1 queue_fn

必须将其设置为负责处理命令的功能，例如，通过将命令传递给低级驱动程序。

3.2 map_queue

执行硬件和软件上下文之间的映射。

3.3 blk_mq_init_queue 函数（在内核4.5中）

在为设备相关的 gendisk 和 request_queue 做好准备后，驱动调用 blk_mq_init_queue 函数。(在内核4.5中)

这个函数初始化硬件和软件上下文，并执行它们之间的映射。

这个初始化例程还设置了一个备用的 make_request 函数，代替了传统的请求提交路径，其中包括函数 blk_make_request() (在内核4.5中)的多队列提交路径(其中包括 blk_mq_make_request() 函数)。

换句话说，备用的 make_request 函数是用 blk_queue_make_reqeust() 设置的。

4. 提交请求

设备初始化用 blk_mq_make_request() 代替了传统的块 I/O 提交函数(在内核4.5中)，让多队列结构从上层的角度来使用。

多队列块层使用的 make_request (在内核4.5中不存在) 函数包含了从进程阻塞中获益的可能性，但只适用于支持单个硬件队列或异步请求的驱动。

如果请求是同步的，并且驱动主动使用多队列接口，则不会阻塞。

如果允许阻塞，make_request 函数也会执行请求合并，先在任务的阻塞列表里面搜索一个候选者。

最后在映射到当前 CPU 的软件队列中，提交路径不涉及任何 I/O 调度相关的回调。

最后，make_request 会立即将任何同步请求发送到相关硬件队列中，而在 async 或 flush 请求的情况下，它会延迟这个过渡，以便后续的合并和更高效的调度。

5. 请求调度

如果一个 I/O 请求是同步的（因此不允许在多队列块层中阻塞），它对设备驱动程序的调度是在同一请求的上下文中进行的。

如果请求是 async 或 flush，则存在任务阻塞。调度的顺序如下：

在提交另一个I/O请求给与同一硬件队列相关联的软件队列时。
在reqeust提交过程中安排的延迟工作被执行时。

多队列块层主要的排队等候函数是 blk_mq_run_hw_queue() (在内核4.5代码中)，它基本依赖于另一个由其 blk_mq_ops 结构的 queue_rq (在内核4.5中)字段指向的驱动专用例程。

重要：我们必须检查 blk_mq_run_hw_queue 和 request_qu 之间的关系！

重要：这个函数可以延迟队列的任何运行，同时它可以立即向驱动发送一个同步请求。

内部函数 __blk_mq_run_hw_queue() (在内核4.5中)，在 reqeust 是同步的情况下被 blk_mq_run_hw_queue() (在内核4.5代码中)调用，首先加入与当前服务的硬件队列相关联的任何软件队列，然后它将结果列表与已经在调度列表中的任何条目加入。

在收集了所有待服务的条目之后，函数 __blk_mq_run_hw_queue() (在内核4.5中)处理它们（这些条目），启动每个 reqeust，并通过它的 queue_rq 函数将其传递给驱动。

该函数最后通过重排或删除相关请求处理可能出现的错误。

原文：https://hyunyoung2.github.io/2016/09/14/Multi_Queue/