Overview

  • Apache Impala (incubating) is the open source, native analytic database for apache Hadoop.

Features

  • Do BI-style Queries on Hadoop:

    • low latency and high concurrency for BI/analytic queries on Hadoop(not delivered by batch frameworks such as Apache Hive).
    • scales linearly, even in multitenant environments.
  • Unify ur Infrasturecture: Utilize the same file and data formats and metadata, security, and resource management frameworks as your Hadoop deployment—no redundant infrastructure or data conversion/duplication.
  • Implement Quickly: supports SQL
  • Count on Enterprise-class Security
  • Retain Freedom from Lock-in: open-source
  • Expand the Hadoop User-verse

Architecuture

  • Circumvents MapReduce to avoid latency, directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.
  • Some advantages:
    • Thx to local processing on data nodes, network bottlenecks are avoided.
    • A signle, open, and unified metadata store can be utilized.
    • Costly data format conversion is unnecessary and thus no overhead is incurred.
    • All data is immediately query-able, with no delays for ETL.
    • All hardware is utilized for Impala queries as well as for MR.
    • Only a single machine pool is needed to scale.

Documentation

... skip

Impala User-Defined Functions(UDFs)

  • UDF let you code ur own application logic for processing column values during an Impala query.

UDFS Concepts

  • U can code either scalar functions for producing results one row at a time.
  • Or more complex aggregate functions for doing analysis across.

UDFs and UDAFs

  • The most general kind of udf takes single input value and produces a single output value. When used in a query, it is called once for each row in the result set. eg:

    select customer_name, is_frequent_customer(customer_id) from customers;
    select obfuscate(sensitive_column) from sensitive_data;
  • A user-defined aggergate function(UDAF) accepts a group of values and returns a single value. U can use UDAFs to summarize and condense sets of rows, in the same style as the built-in COUNT, MAX(), SUM(), and AVG() functions. When called in a query that uses the GROUP BY clause, the function is called once for each combination of GROUP BY values. eg:
    -- Evaluates multiple rows but returns a single value
    select closest_restaurant(latitude, longitude) from places; -- Evaluates batches of rows and returns a separate value for each batch.
    select most_profitable_locartion(store_id, sales, expenses, tax_rate, depreciation) from franchise_data group by year;
  • Currently, Impala does not support other categories of udf, such as user-defined table functions(UDTFs) or window functions.

Native Impala UDFs

  • Impala supports UDFs written in C++, in addition to supporting existing Hive UDFs written in Java.
  • Where practical, use C++ UDFs because the compiled native code can yield higher performance, with UDF execution time often 10x faster for a C++ UDF than the equivalent Java UDF.

Using Hive UDFs with Impala

  • Impala can run Java-based user-defined functions (UDFs), originally written for Hive, with no changes, subject to the following conditions:

    • The parameter and return value must all use scalar data types supported by Impala. That's to say, complex or nested types are not supported.
    • Currently, Hive UDFs that accept or return the TIMESTAMP type are not supported.
    • Hive UDAFs and UDTFs are not supported.
    • Typically, a Java UDF will execute several times slower in Impala than the equivalent native UDF written in C++.
  • What to do next?
    • write ur udf
    • upload the jar to a hdfs path(where impala can read)
    • for each Java-based UDF that u want to call through Impala, issue a CREATE FUNCTION statement, with a LOCATION clause containing the full HDFS path or the JAR file, and a SYMBOL clause with the fully qualified name of the class, using dots as separators and without the .class extension. eg:
      create function my_neg(bigint)
      returns bigint location '/user/hive/udfs/hive.jar'
      symbol = 'org.apache.hadoop.hive.ql.udf.UDFOPNegative';
    • call the function from ur queries, passing arguments of the correct type to match the function signature.

FYI

<Impala><Overview><UDF>的更多相关文章

  1. 简单物联网:外网访问内网路由器下树莓派Flask服务器

    最近做一个小东西,大概过程就是想在教室,宿舍控制实验室的一些设备. 已经在树莓上搭了一个轻量的flask服务器,在实验室的路由器下,任何设备都是可以访问的:但是有一些限制条件,比如我想在宿舍控制我种花 ...

  2. 利用ssh反向代理以及autossh实现从外网连接内网服务器

    前言 最近遇到这样一个问题,我在实验室架设了一台服务器,给师弟或者小伙伴练习Linux用,然后平时在实验室这边直接连接是没有问题的,都是内网嘛.但是回到宿舍问题出来了,使用校园网的童鞋还是能连接上,使 ...

  3. 外网访问内网Docker容器

    外网访问内网Docker容器 本地安装了Docker容器,只能在局域网内访问,怎样从外网也能访问本地Docker容器? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动Docker容器 ...

  4. 外网访问内网SpringBoot

    外网访问内网SpringBoot 本地安装了SpringBoot,只能在局域网内访问,怎样从外网也能访问本地SpringBoot? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装Java 1 ...

  5. 外网访问内网Elasticsearch WEB

    外网访问内网Elasticsearch WEB 本地安装了Elasticsearch,只能在局域网内访问其WEB,怎样从外网也能访问本地Elasticsearch? 本文将介绍具体的实现步骤. 1. ...

  6. 怎样从外网访问内网Rails

    外网访问内网Rails 本地安装了Rails,只能在局域网内访问,怎样从外网也能访问本地Rails? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动Rails 默认安装的Rails端口 ...

  7. 怎样从外网访问内网Memcached数据库

    外网访问内网Memcached数据库 本地安装了Memcached数据库,只能在局域网内访问,怎样从外网也能访问本地Memcached数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装 ...

  8. 怎样从外网访问内网CouchDB数据库

    外网访问内网CouchDB数据库 本地安装了CouchDB数据库,只能在局域网内访问,怎样从外网也能访问本地CouchDB数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动Cou ...

  9. 怎样从外网访问内网DB2数据库

    外网访问内网DB2数据库 本地安装了DB2数据库,只能在局域网内访问,怎样从外网也能访问本地DB2数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动DB2数据库 默认安装的DB2 ...

  10. 怎样从外网访问内网OpenLDAP数据库

    外网访问内网OpenLDAP数据库 本地安装了OpenLDAP数据库,只能在局域网内访问,怎样从外网也能访问本地OpenLDAP数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动 ...

随机推荐

  1. 搭建智能合约开发环境Remix IDE及使用

    目前开发智能的IDE, 首推还是Remix, 而Remix官网, 总是由于各种各样的(网络)原因无法使用,本文就来介绍一下如何在本地搭建智能合约开发环境remix-ide并介绍Remix的使用. 写在 ...

  2. ASP.Net MVC多语言

    .NET MVC 多语言网站 通过浏览器语言首选项改变MVC的语言,通过浏览器语言选项,修改脚本语言. 一.添加资源文件 1.添加App_GlobalResources文件夹. 2.添加默认的资源文件 ...

  3. hdu-3001 三进制状态压缩+dp

    用dp来求最短路,虽然效率低,但是状态的概念方便解决最短路问题中的很多限制,也便于压缩以保存更多信息. 本题要求访问全图,且每个节点不能访问两次以上.所以用一个三进制数保存全图的访问状态(3^10,空 ...

  4. 【Java】【3】BeanUtils.copyProperties();将一个实体类的值复制到另外一个实体类

    正文: a,b为对象 BeanUtils.copyProperties(a,  b); 1,BeanUtils是org.springframework.beans.BeanUtils, a拷贝到b 2 ...

  5. NABCD框架(作业和事件的定期提醒)及第八周学习进度条

    NABCD框架(作业和事件的定期提醒): N(need,需求): 你的创意解决了用户的什么需求? 我们的创意能够一定程度上督促我们的用户(学生)尽快完成自己近期的任务或者是作业.我们认为如果增设定时提 ...

  6. TCP如何保证可靠性

    如何保证可靠性? 1.校验和.在TCP的首部中有一个占据16为的空间用来放置校验和的结果. 这是一个端到端的检验和,目的是检测数据在传输过程中的任何变化.如果收到段的检验和有差错,TCP将丢弃这个报文 ...

  7. SVN 多分支管理

    SVN 新建时可以选择性的建立三个文件夹 trunk            一般作为主开发的地方 branches       一般作为从trunk Copy过去的代码,形成分支 tags       ...

  8. Visual Studio编译时报错“函数名:重定义;不同的基类型”

    错误原因: 方法在还未声明的地方就使用了.由于使用的地方与定义的地方都是在同一个.c文件中,所以没有报未声明的错误. 解决方法: 把实现放到使用的前面,或者在include语句和宏定义后面加上函数声明 ...

  9. mysql索引简单分析

    索引对查询的速度有着至关重要的影响,理解索引也是进行数据库性能调优的起点.考虑如下情况,假设数据库中一个表有10^6条记录,DBMS的页面大小为4K,并存储100条记录.如果没有索引,查询将对整个表进 ...

  10. laravel注册行为的方法和逻辑

    public function register() { //验证: $this->validate(\request(), [ 'name' => 'required|min:3|uni ...