Introducing shard translator

by Krutika Dhananjay on December 23, 2015

GlusterFS-3.7.0 saw the release of sharding feature, among several others. The feature was tagged as “experimental” as it was still in the initial stages of development back then. Here is some introduction to the feature:

Why shard translator?

GlusterFS’ answer to very large files (those which can grow beyond a single brick) had never been clear. There is a stripe translator which allows you to do that, but that comes at a cost of flexibility – you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an “elegant” way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.) The proposed solution for this is to replace the current stripe translator with a new “shard” translator.

What?

Unlike stripe, Shard is not a cluster translator. It is placed on top of DHT. Initially all files will be created as normal files, even up to a certain configurable size. The first block (default 4MB) will be stored like a normal file under its parent directory. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.shard/GFID1.1, /.shard/GFID1.2 … /.shard/GFID1.N). File IO happening to a particular offset will write to the appropriate “piece file”, creating it if necessary. The aggregated file size and block count will be stored in the xattr of the original file.

Usage:

Here I have a 2×2 distributed-replicated volume.

# gluster volume info
Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 96001645-a020-467b-8153-2589e3a0dee3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/1
Brick2: server2:/bricks/2
Brick3: server3:/bricks/3
Brick4: server4:/bricks/4
Options Reconfigured:
performance.readdir-ahead: on

To enable sharding on it, this is what I do:

# gluster volume set dis-rep features.shard on
volume set: success

Now, to configure the shard block size to 16MB, this is what I do:

# gluster volume set dis-rep features.shard-block-size 16MB
volume set: success

How files are sharded:

Now I write 84MB of data into a file named ‘testfile’.

# dd if=/dev/urandom of=/mnt/glusterfs/testfile bs=1M count=84
84+0 records in
84+0 records out
88080384 bytes (88 MB) copied, 13.2243 s, 6.7 MB/s

Let’s check the backend to see how the file was sharded to pieces and how these pieces got distributed across the bricks:

# ls /bricks/* -lh
/bricks/1:
total 0

/bricks/2:
total 0

/bricks/3:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

/bricks/4:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

So the file hashed to the second replica set (brick3 and brick4 which form a replica pair) and 16M in size. Where did the remaining 68MB worth of data go? To find out, let’s check the contents of the hidden directory .shard on all bricks:

# ls /bricks/*/.shard -lh
/bricks/1/.shard:
total 37M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/2/.shard:
total 37M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/3/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

/bricks/4/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

So, the file was basically split into 6 pieces: 5 of them residing in the hidden directory “/.shard” distributed across replica sets based on disk space availability and the file name hash, and the first block residing in its native parent directory. Notice how blocks 1 through 4 are all of size 16M and the last block (block-5) is 4M in size.

Now let’s do some math to see how ‘testfile’ was “sharded”:

The total size of the write was 84MB. And the configured block size in this case is 16MB. So (84MB divided by 16MB) = 5 with remainder = 4MB

So the file was basically broken into 6 pieces in all, with the last piece having 4MB of data and the rest of them 16MB in size.

Now when we view the file from the mount point, it would appear as one single file:

# ls -lh /mnt/glusterfs/
total 85M
-rw-r–r–. 1 root root 84M Dec 24 12:36 testfile

Notice how the file is shown to be of size 84MB on the mount point. Similarly, when the file is read by an application, the different pieces or ‘shards’ are stitched together and appropriately presented to the application as if there was no chunking done at all.

Advantages of sharding:

The advantage of sharding a file over striping it across a finite set of bricks are:

Data blocks are distributed by DHT in a “normal way”.
Adding servers can happen in any number (even one at a time) and DHT’s rebalance will spread out the “piece files” evenly.
Sharding provides better utilization of disk space. Now it is no longer necessary to have at least one brick of size X in order to accommodate a file of size X, where X is really large. Consider this example: A distribute volume is made up of 3 bricks of size 10GB, 20GB, 30GB. With this configuration, it is impossible to store a file greater than 30GB in size on this volume. Sharding eliminates this limitation. A file of upto 60GB size can be stored on this volume with sharding.
Self-healing of a large file is now more distributed into smaller files across more servers leading to better heal performance and lesser CPU usage, which is particularly a pain point for large file workloads.
piece file naming scheme is immune to renames and hardlinks.
When geo-replicating a large file to a remote volume, only the shards that changed can be synced to the slave, considerably reducing the sync time.
When sharding is used in conjunction with tiering, only the shards that change would be promoted/demoted. This reduces the amount of data that needs to be migrated between hot and cold tier.
When sharding is used in conjunction with bit-rot detection feature of GlusterFS, the checksum is computed on smaller shards as opposed to one large file.

Yes, sharding in its current form is not compatible with directory quota. This is something we are going to focus on, in the coming days – to make it compatible with other Gluster features (including directory quota and user/group quota which is a feature in design phase).

Thanks,
Krutika

Introducing shard translator的更多相关文章

Global Translator
Global Translator插件可以把已经通过翻译服务翻译好的内容生成对应语种的“静态”页面,或者说“缓存”起来,这样在一段时间内(可设置)想访问该语种的这个页面的访客,就可以在不调用翻译服务 ...
MongoDBV3.0.7版本(shard+replica)集群的搭建及验证
集群的模块介绍: 从MongoDB官方给的集群架构了解,整个集群主要有4个模块:Config Server.mongs. shard.replica set: Config Server:用来存放集群 ...
在Application中集成Microsoft Translator服务之使用http获取服务
一.创建项目首先我们来创建一个ASP.NET Application 选择时尚时尚最时尚的MVC,为了使演示的Demo更简单,这里选择无身份验证二.创建相关类项目需要引入之前两个类AdmAcce ...
在Application中集成Microsoft Translator服务之翻译语言代码
Microsoft Translator支持多种语言,当我们获取服务时使用这些代码来表示我们是使用哪种语言翻译成什么语言,以下是相关语言对应的代码和中文名为了方便我已经将数据库上传到云盘上,读者可 ...
在Application中集成Microsoft Translator服务之获取访问令牌
我在这里画了一张图来展示业务逻辑在我们调用microsoft translator server之前需要获得令牌,而且这个令牌的有效期为10分钟.下表列出所需的参数和对于的说明参数描述 clie ...
在Application中集成Microsoft Translator服务之开发前准备
第一步:准备一个微软账号要使用Microsoft Translator API需要在Microsoft Azure Marketplace(https://datamarket.azure.com/ ...
十三、File Translator怎么写
---恢复内容开始--- 1. File Translator可以将信息从maya中导入和导出. 2. 创建一个file translator需要从MPxFileTranslator继承. 3. 函数 ...
MongoDB Shard部署及Tag的使用
Shard部署准备测试环境为准备数据文件夹 Cd /home/tiansign/fanr/mongodb/Shard mkdir configdb1 configdb2 configdb3 mk ...
Solr术语介绍:SolrCloud,单机Solr,Collection,Shard,Replica,Core之间的关系
Solr有一堆让人发晕的术语如:collections,shards,replicas,cores,config sets. 在了解这些术语之前需要先做做如下功课: 1)什么是倒排索引? 2)搜索引擎 ...

随机推荐

Codeforces Round #207 (Div. 2) A. Group of Students
#include <iostream> #include <vector> using namespace std; int main(){ ; cin >> m ...
【BZOJ】2038: [2009国家集训队]小Z的袜子(hose)（组合计数+概率+莫队算法+分块）
http://www.lydsy.com/JudgeOnline/problem.php?id=2038 学了下莫队,挺神的orzzzz 首先推公式的话很简单吧... 看的题解是从http://for ...
实现自己的Linq to Sql
之前写过一篇<统一的仓储接口>,为了方便使用不同的仓储.在我们的项目中使用的是EF4.0,但是这个版本的EF有一些性能问题没有解决,又不想升级到EF6,具体EF6有没有解决暂时不清楚.我们 ...
Java Web网站应用中的单点登录
采用SSH架构加以说明:1. 建立一个登录管理类LoginManager2. 在LoginManager中定义一个集合,管理登录的用户.3. 在Spring中将LoginManager配置成单例 ...
UIView+LHQExtension(分类)
// // UIView+LHQExtension.h // 微博 - 李洪强(2016-5-27) // // Created by vic fan on 16/5/30. // Copyr ...
MSF命令收集
一.msfconsole ? 帮助菜单 back 从当前环境返回 banner 显示一个MSF banner cd 切换目录 color 颜色转换 connect 连接一个主机 e ...
PHP 设计模式笔记与总结（7）适配器模式
① 适配器模式可以将截然不同的函数接口封装成统一的 API ② 实际应用举例:PHP 的数据库操作有 mysql,mysqli,pdo 三种,可以用适配器模式统一成一致.类似的场景还有 cache 适 ...
UDP 构建p2p打洞过程的实现原理(持续更新)
UDP 构建p2p打洞过程的实现原理(持续更新) 发表于7个月前(2015-01-19 10:55) 阅读(433) | 评论(0) 8人收藏此文章, 我要收藏赞0 8月22日珠海 OSC 源创 ...
DirectX 常用选项（转）
内存池表面和其它一些Direct3D资源被放在多种内存池中.内存池的种类由D3DPOOL枚举类型的一个成员来指定.可用到的内存池有下列几种:D3DPOOL_DEFAULT--表示Direct3D将根据 ...
MVC4发布到IIS7报404错误
在MVC根目录的web.config中添加 <system.webServer> <modules runAllManagedModulesForAllRequests=" ...

Introducing shard translator

Introducing shard translator

Introducing shard translator的更多相关文章

随机推荐

热门专题