Introducing shard translator

by Krutika Dhananjay on December 23, 2015

GlusterFS-3.7.0 saw the release of sharding feature, among several others. The feature was tagged as “experimental” as it was still in the initial stages of development back then. Here is some introduction to the feature:

Why shard translator?

GlusterFS’ answer to very large files (those which can grow beyond a single brick) had never been clear. There is a stripe translator which allows you to do that, but that comes at a cost of flexibility – you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an “elegant” way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.) The proposed solution for this is to replace the current stripe translator with a new “shard” translator.

What?

Unlike stripe, Shard is not a cluster translator. It is placed on top of DHT. Initially all files will be created as normal files, even up to a certain configurable size. The first block (default 4MB) will be stored like a normal file under its parent directory. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.shard/GFID1.1, /.shard/GFID1.2 … /.shard/GFID1.N). File IO happening to a particular offset will write to the appropriate “piece file”, creating it if necessary. The aggregated file size and block count will be stored in the xattr of the original file.

Usage:

Here I have a 2×2 distributed-replicated volume.

# gluster volume info
Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 96001645-a020-467b-8153-2589e3a0dee3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/1
Brick2: server2:/bricks/2
Brick3: server3:/bricks/3
Brick4: server4:/bricks/4
Options Reconfigured:
performance.readdir-ahead: on

To enable sharding on it, this is what I do:

# gluster volume set dis-rep features.shard on
volume set: success

Now, to configure the shard block size to 16MB, this is what I do:

# gluster volume set dis-rep features.shard-block-size 16MB
volume set: success

How files are sharded:

Now I write 84MB of data into a file named ‘testfile’.

# dd if=/dev/urandom of=/mnt/glusterfs/testfile bs=1M count=84
84+0 records in
84+0 records out
88080384 bytes (88 MB) copied, 13.2243 s, 6.7 MB/s

Let’s check the backend to see how the file was sharded to pieces and how these pieces got distributed across the bricks:

# ls /bricks/* -lh
/bricks/1:
total 0

/bricks/2:
total 0

/bricks/3:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

/bricks/4:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

So the file hashed to the second replica set (brick3 and brick4 which form a replica pair) and 16M in size. Where did the remaining 68MB worth of data go? To find out, let’s check the contents of the hidden directory .shard on all bricks:

# ls /bricks/*/.shard -lh
/bricks/1/.shard:
total 37M
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/2/.shard:
total 37M
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/3/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

/bricks/4/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

So, the file was basically split into 6 pieces: 5 of them residing in the hidden directory “/.shard” distributed across replica sets based on disk space availability and the file name hash, and the first block residing in its native parent directory. Notice how blocks 1 through 4 are all of size 16M and the last block (block-5) is 4M in size.

Now let’s do some math to see how ‘testfile’ was “sharded”:

The total size of the write was 84MB. And the configured block size in this case is 16MB. So (84MB divided by 16MB) = 5 with remainder = 4MB

So the file was basically broken into 6 pieces in all, with the last piece having 4MB of data and the rest of them 16MB in size.

Now when we view the file from the mount point, it would appear as one single file:

# ls -lh /mnt/glusterfs/
total 85M
-rw-r–r–. 1 root root 84M Dec 24 12:36 testfile

Notice how the file is shown to be of size 84MB on the mount point. Similarly, when the file is read by an application, the different pieces or ‘shards’ are stitched together and appropriately presented to the application as if there was no chunking done at all.

Advantages of sharding:

The advantage of sharding a file over striping it across a finite set of bricks are:

  • Data blocks are distributed by DHT in a “normal way”.
  • Adding servers can happen in any number (even one at a time) and DHT’s rebalance will spread out the “piece files” evenly.
  • Sharding provides better utilization of disk space. Now it is no longer necessary to have at least one brick of size X in order to accommodate a file of size X, where X is really large. Consider this example: A distribute volume is made up of 3 bricks of size 10GB, 20GB, 30GB. With this configuration, it is impossible to store a file greater than 30GB in size on this volume. Sharding eliminates this limitation. A file of upto 60GB size can be stored on this volume with sharding.
  • Self-healing of a large file is now more distributed into smaller files across more servers leading to better heal performance and lesser CPU usage, which is particularly a pain point for large file workloads.
  • piece file naming scheme is immune to renames and hardlinks.
  • When geo-replicating a large file to a remote volume, only the shards that changed can be synced to the slave, considerably reducing the sync time.
  • When sharding is used in conjunction with tiering, only the shards that change would be promoted/demoted. This reduces the amount of data that needs to be migrated between hot and cold tier.
  • When sharding is used in conjunction with bit-rot detection feature of GlusterFS, the checksum is computed on smaller shards as opposed to one large file.

Yes, sharding in its current form is not compatible with directory quota. This is something we are going to focus on, in the coming days – to make it compatible with other Gluster features (including directory quota and user/group quota which is a feature in design phase).

Thanks,
Krutika

Introducing shard translator的更多相关文章

  1. Global Translator

    Global Translator插件可以把已经通过翻译服务翻译好的内容生成对应语种的“静态”页面,或者说“缓存”起来,这样在一段时间内(可设置)想访问该语种的这 个页面的访客,就可以在不调用翻译服务 ...

  2. MongoDBV3.0.7版本(shard+replica)集群的搭建及验证

    集群的模块介绍: 从MongoDB官方给的集群架构了解,整个集群主要有4个模块:Config Server.mongs. shard.replica set: Config Server:用来存放集群 ...

  3. 在Application中集成Microsoft Translator服务之使用http获取服务

    一.创建项目 首先我们来创建一个ASP.NET Application 选择时尚时尚最时尚的MVC,为了使演示的Demo更简单,这里选择无身份验证 二.创建相关类 项目需要引入之前两个类AdmAcce ...

  4. 在Application中集成Microsoft Translator服务之翻译语言代码

    Microsoft  Translator支持多种语言,当我们获取服务时使用这些代码来表示我们是使用哪种语言翻译成什么语言,以下是相关语言对应的代码和中文名 为了方便我已经将数据库上传到云盘上,读者可 ...

  5. 在Application中集成Microsoft Translator服务之获取访问令牌

    我在这里画了一张图来展示业务逻辑 在我们调用microsoft translator server之前需要获得令牌,而且这个令牌的有效期为10分钟.下表列出所需的参数和对于的说明 参数 描述 clie ...

  6. 在Application中集成Microsoft Translator服务之开发前准备

    第一步:准备一个微软账号 要使用Microsoft Translator API需要在Microsoft Azure Marketplace(https://datamarket.azure.com/ ...

  7. 十三、File Translator怎么写

    ---恢复内容开始--- 1. File Translator可以将信息从maya中导入和导出. 2. 创建一个file translator需要从MPxFileTranslator继承. 3. 函数 ...

  8. MongoDB Shard部署及Tag的使用

    Shard部署 准备测试环境 为准备数据文件夹 Cd  /home/tiansign/fanr/mongodb/Shard mkdir configdb1 configdb2 configdb3 mk ...

  9. Solr术语介绍:SolrCloud,单机Solr,Collection,Shard,Replica,Core之间的关系

    Solr有一堆让人发晕的术语如:collections,shards,replicas,cores,config sets. 在了解这些术语之前需要先做做如下功课: 1)什么是倒排索引? 2)搜索引擎 ...

随机推荐

  1. 我理解的 js 的观察者模式 Observable

    我第一次看 四人帮 写的<设计模式>时一头雾水,现在也是,或许其是针对专业的程序员学习使用的. 通过对Ext / Backbone 源码的学习,可总结如下: 模式 - 就是对解决某一类特定 ...

  2. 【BZOJ】1503: [NOI2004]郁闷的出纳员(Splay)

    http://www.lydsy.com/JudgeOnline/problem.php?id=1503 这题没有看题解就1a了-好开心,, 其实后面去看题解发现他们的都很麻烦,其实有种很简单的做法: ...

  3. Mysql 学习

    一.ubuntu下mysql的安装: 通过sudo apt-get install mysql-server 完成: 然后可以通过/etc/init.d/mysql 进行start/stop/rest ...

  4. Cobar_基于MySQL的分布式数据库服务中间件

    Cobar是阿里巴巴研发的关系型数据的分布式处理系统,是提供关系型数据库(MySQL)分布式服务的中间件,该产品成功替代了原先基于Oracle的数据存储方案,它可以让传统的数据库得到良好的线性扩展,并 ...

  5. iOS中--NSArray调用方法详解 (李洪强)

    下面的例子以     NSArray *array = [NSArray arrayWithObjects:@"wendy",@"andy",@"to ...

  6. 当前标识(NT AUTHORITY\NETWORK SERVICE)没有对“C:\WINDOWS\Microsoft.NET\Frame

    异常详细信息: System.Web.HttpException: 当前标识(NT AUTHORITY\NETWORK SERVICE)没有对“C:\WINDOWS\Microsoft.NET\Fra ...

  7. java---一元二次方程练习

    public class wu{ public static void main(String[] args){ int a = 2,b = 1, c = 0,d = b*b-4*a*c if (a ...

  8. MyBatis 配置sql语句输出

    版权声明:本文为博主原创文章,未经博主允许不得转载. 此处使用log4j,加入jar包,然后在src路径下加入:log4j.properties文件 填入以下配置就可以打印了 log4j.rootLo ...

  9. OpenFlow.p4 源码

    /* Copyright 2013-present Barefoot Networks, Inc. Licensed under the Apache License, Version 2.0 (th ...

  10. 数据库中MyISAM与InnoDB区别

    数据库中MyISAM与InnoDB区别 首页 » DIY技术区 » 数据库中MyISAM与InnoDB区别 09:57:40   MyISAM:这个是默认类型,它是基于传统的ISAM类型,ISAM是I ...