1. I am Charles Humble and I am here at QCon London with Eva Andreasson from Cloudera. Eva, can you introduce yourself to the InfoQ community?

Who am I? I am Eva Andreasson and I am a product manager working for Cloudera at the moment. I also have a past as a JVM developer and I love Java.

2. So Cloudera is the main distributor for Apache Hadoop. Can you tell us a bit about what Hadoop is?

For someone who does not know what Hadoop is, it is a new way of processing data. Many people might be familiar with a database or a enterprise data warehouse and how you structure data up front to serve different queries, business insight queries you want to ask your data to serve your business.

Hadoop has come in as a very new way of thinking about how to process data. It kind of turns the world upside down. It actually allows you to store your data as is, you do not have to structure it at point of storage. You can decide later, as use cases come up, on how and what structure to apply to your data to serve various questions; meaning you do not have to scale down or lose any of your data to fit into traditional systems, you can actually store everything and decide later what you want to do with it. That is kind of the power of Hadoop. In addition, it is open source and it is linearly scalable and it is very, very cost efficient. So, it is this flexibility on the very cost efficient storage and processing framework has revolutionized the world in the data processing space.

3. Let's drill into that a little bit. So you have a couple of components; you have got a distributed file system [EVA: Yes] and then you've got HBase as well, which is kind of like a key value store, as I understand it, on top of HDFS. Is that right?

It is a column oriented key value store where you can have sparsely populated data, meaning many people use HBase, for instance, to save all your clicks on a web site. You might not click all the links, but for each visitor you save the data that visitor A clicked on all of these 10 links or visitor B clicked on 50 links, allowing you to match behavior. So you can look up a customer’s behavior on your web site and match out who else is similar, as they are visiting your web site, and maybe dynamically change the experience - present similar items that the previous customer with a similar behavior actually purchased. So, you can adapt your customer experience and do a better user experience, shorter sales cycle, and more efficient “get to the information you want on a web page”, as the visitors are there. So, it is very interesting use cases you can do with HBase.

4. If I am executing a MapReduce query on a set of files on HDFS where does that actually happen? Is the data being bought in, and is it being executed centrally, or is it being executed on all the individual nodes and then assembled afterwards? How does it work?

MapReduce is a Java framework and it follows a MapReduce algorithm, where in the mapping phase you write some code on how or what data is of relevance to solve a specific use case, a question you want to ask. So basically return all the rows from all my files that are stored in a distributed way on HDFS that contain “red sweater” or something like that. Then you get all the data that contains “red sweater”. It just maps it out, extracts that data, puts it into temporary work files that you then can apply logic to. You do the reducing step. You actually apply the structure in the mapping phase, while applying the logic in the reducing phase. So now you can count all those entries containing “red sweater” and you can do this over a petabyte of data or 100 terabytes or 1 terabyte. It does not matter because it is parallel processing and linearly scalable with as many nodes as you add to it. So the aggregation and logic in the reducing step returns the end result to you.

So it is a two-step algorithm, executed in parallel on a distributed file system and then aggregated back to you, with no need of structure up front. You decided when you designed your MapReduce job. But that is the simple way of using Hadoop. There are more advanced ways today.

5. That is quite a big shift, I think, in terms of how we think about working with data. So, if I am a large traditional enterprise, perhaps I have a big BI team and a bunch of DBAs who are very used to working with relational data, how easy is it for me to make the switch to working with something like Hadoop?

That is a very good question and I get it a lot. Working for Cloudera, we help customers do that all the time. There is a lot that has happened in the recent years. It is not just MapReduce, the batch oriented Java framework. You have a number of tools, a number of workloads that you can actually execute on the same integrated file storage. So, you have Hive which is a batch oriented query engine, you have Impala, which is a real time query engine, both supporting standard SQL and JDBC and ODBC. So what you can query, the application layer, many of the BI tools can already work on top of Hadoop and its ecosystem today.

Then you also have search, for instance free text search; we have recently integrated Solr to Cloudera’s distribution of Hadoop. Then you have other frameworks and machine learning and even running SasS workloads natively in this platform, allows you to do many of the workloads you previously did in your enterprise data warehouse, for instance, in a much more scalable and cost efficient way on Hadoop.

I would though recommend to start with something very well defined, something you know is repetitive, you have known results in your enterprise data warehouse, where you struggle from the business side perhaps with scale, you have to finish more data processing within the same fixed time window – that is a very common scenario. Take a well scoped out project and work with someone like Cloudera or some other vendor that has experience, some SI that has experience to do this kind of migration, and then learn from that. After that, you will be an expert and will see for sure the value of having Hadoop in your enterprise and you can grow incrementally. That is how our customers have been successful.

6. That is interesting. So I have got very different workloads there if I've got Impala and MapReduce and Solr all running on the same platform. How do you deal with things like resource management issues that I'm presuming that would throw up?

Yes. That is a very good question. We also get that a lot. With the new version coming out in March, we have a productionized YARN, which is a new way of resource scheduling. It is now production ready. It is going to be the default scheduling capacity within the entire platform. We have also added ways to control the resources through our production tool – Cloudera Manager – to make sure you can isolate various workloads from starving each other. We are starting with a static resource allocation for workloads that are more RAM focused, in-memory focused, such as, for instance Solr and HBase; so you can assure they can serve those real time use cases securely with having your data scientists or developer team explore with MapReduce and modeling and machine learning in the background, without affecting those running services.

But the whole idea through Cloudera Manager to control based on what user group they are and what priority they are and what resources should be firmly allocated or dynamically allocated to various use cases; that is key. That is what we provided through our production tool.

7. Do you have tools for fail over as well if say an HBase server goes down?

That is actually handled by ZooKeeper, which is the ... All these funny animal names in the Hadoop ecosystem. Apache Hadoop ecosystem, has a ZooKeeper - handling the fail over and the process management; a very key component that is seldom highlighted but it definitely deserves the spotlight. It handles HBase process management, it handles SolrCloud process management and it handles a lot of those fail over that you just expect from Cloudera’s distribution in production.

8. If I am, say, a Java developer and I want to get to grips with Hadoop, where would you suggest I start?

Come to our online free training on our website; that is a good place. Or download the quick-start VM to experiment, to wrap your head around this, because it opens the door to so many possibilities. You need that learning, you need that experimentation. You can also go and download HUE – Hadoop User Experience. It is a project launched by Cloudera, also open-source. There are a lot of “How to” videos that show different applications on how you can interact with the back-end; the various workloads on the enterprise data hub today. And ask us questions in our forums or in the open-source. There are many, many people involved; many engineers happy to help.

9. So of course you have a background with working on GC for Java and I am interested in ... We have talked a little bit about RAM intensive workloads with the search engine, with Solr and so on. Are there specific unsolved GC problems that have a significant impact on the work you are doing with Hadoop?

There are multiple questions in there. Let me try to think and dissect them. I think as we brought more real time workloads to Hadoop, to the enterprise data hub, there are more needs of in-memory workloads such as Solr and HBase and Impala. For Solr and HBase, they are very RAM intensive and they are both based on Java, which means they will rely on good garbage collection algorithms. And, as you might have heard in my talk, the biggest problem today that I see with garbage collection algorithms is the “stop-the-world” operations; especially that is evident when it comes to compaction and the way to handle fragmented heaps; and fragmented heaps are the result of a long time running Java process such as, for instance, Solr processes or HBase processes. So I definitely see GC problems coming down the road, if not already there, with these kinds of workloads.

If you look at the other workloads, like spinning-up sudden processes, applying structure to data, returning the result – it is a more dynamic, up and down spin; that is a different profile. There I do not see much garbage collection innovation helping.

But for those profiles where there are long-time running processes, we need to solve how we get all the threads to safe point, which I think is the biggest bottle neck for stop-the-world operations in garbage collectors today.

10. Obviously we've got Java 8 coming out any minute. Is there anything in Java 8 that is sort of exciting for you?

That is a trick question. There is a lot of good stuff in Java 8. I have not been able to catch up on everything yet, but I am excited for other reasons, not related to my work, around the Nashorn project that will help JavaScript a lot. And also I am excited about G1, that many people talk about, but it hasn’t really solved the problem of stop-the-world operations, so I am yet to see and get very excited about the garbage collection part of Java forward.

Charles: Great. OK. Thank you very much.

Thank you.

infoq的更多相关文章

  1. infoq 微信后台存储架构

    infoq 上微信后台存储架构 视频很是值得认真一听,大概内容摘要如下: 主要内容:同城分布式强一致,园区级容灾KV存储系统 - sync 序列号发生器      移动互联网场景下,频繁掉线重连,使用 ...

  2. infoq - neo4j graph db

    My name is Charles Humble and I am here at QCon New York 2014 with Ian Robinson. Ian, can you introd ...

  3. JavaScript MVC框架和语言总结[infoq]

    infoq关于javascript的语言和框架的总结,非常全面,值得一读. http://www.infoq.com/minibooks/emag-javascript Contents of the ...

  4. Selenium实战脚本集(3)--抓取infoq里的测试新闻

    描述 打开infoq页面,抓取最新的一些测试文章 需要抓取文章的标题和内容 如果你有个人blog的话,可以将这些文章转载到自己的blog 要求 不要在新窗口打开文章 自行了解最新的测试思潮与实践

  5. 转: 学习开源项目的若干建议(infoq)

    转: http://www.infoq.com/cn/news/2014/04/learn-open-source 学习开源项目的若干建议 作者 崔康 发布于 2014年4月11日 | 注意:GTLC ...

  6. 转:自建CDN防御DDoS(1, 2, 3)infoq

    本文中提到的要点: 1.  针对恶意流的应对方法与策略.(基本上,中级的,顶级的) 2.  IP分类的脚本 3.  前端proxy工具的选择与使用. 4.  开源日志系统的选择与比较. (http:/ ...

  7. 转: 微博的多机房部署的实践(from infoq)

    转:  http://www.infoq.com/cn/articles/weibo-multi-idc-architecture 在国内网络环境下,单机房的可靠性无法满足大型互联网服务的要求,如机房 ...

  8. 打造自己的程序员品牌(摘自Infoq)

    John Sonmez是Simple Programmer的创始人.作者与程序员,关注于如何让复杂的事情变得简单.他是一位专业的软件开发者.架构师与讲师,感兴趣的领域包括测试驱动开发.如何编写整洁的代 ...

  9. 如何使代码审查更高效【摘自InfoQ】

      代码审查者在审查代码时有非常多的东西需要关注.一个团队需要明确对于自己的项目哪些点是重要的,并不断在审查中就这些点进行检查. 人工审查代码是十分昂贵的,因此尽可能地使用自动化方式进行审查,如:代码 ...

随机推荐

  1. 如何在CentOS/RHEL & Fedora上安装MongoDB 3.2

    MongoDB(名称取自"huMONGOus")是一个有着全面灵活的索引支持和丰富的查询的数据库.MongoDB通过GridFS提供强大的媒体存储.点击这里获取MongoDB的更多 ...

  2. 【python+mysql】在python中调用mysql出问题 ImportError: No module named MySQLdb.constants

    遇到如下异常: File "C:\Users\Neil\PycharmProjects\ScrapyDouban\book\book\database.py", line 4, i ...

  3. Slave作为其它Slave的Master时使用

    主从配置需要注意的点 (1)主从服务器操作系统版本和位数一致: (2) Master和Slave数据库的版本要一致: (3) Master和Slave数据库中的数据要一致: (4) Master开启二 ...

  4. 2016.9.20 java上课作业

    此程序从命令行接收多个数字,求和之后输出

  5. 【emWin】例程七:绘制基本图形

    简介:emWin 包含完整的2-D 图形库,可供大多数应用程序充分使用 本例程介绍如何使用图形API绘制基本的2-D图形 实验指导书及代码包下载: 链接:http://pan.baidu.com/s/ ...

  6. 配合 APP 调用 JS 的一次尝试

    项目初衷 最初的场景是用户在对购物车的操作中,由于用户对购物车的每次操作(包括选择,调整数量)都需要计算商品的促销和分组的情况,而这段逻辑的计算都需要调用后端的接口,那么瓶颈来了: 请求时间长--一次 ...

  7. Ionic + AngularJS

    Ionic Framework Ionic framework is the youngest in our top 5 stack, as the alpha was released in lat ...

  8. linux-shell笔记

    1.当从windows拖到shell中无法传递文件时,多半可能没有权限,可用sudo rz来进行手动选择传递 2.连接虚拟机时,ssh 用户名@ip地址,然后会提示输入该虚拟机密码,输入密码即可连接 ...

  9. C/C++语言,自学资源,滚动更新中……

    首先要说<一本通>是一个很好的学习C/C++语言的自学教材. 以下教学视频中,缺少对"字符串"技术的讨论,大家注意看书.         一维数组,及其举例:(第四版) ...

  10. 临时存存储页面上的数据---Web存储

    HTML5 Web存储的两种方法使用 localStorage和sessionStorage 参考: http://www.cnblogs.com/taoweiji/archive/2012/12/0 ...