Pentaho的Mondrian对Hive的支持

需求描述

考虑直接在Hive或者Impala等Big Data方案，能够支持MDX查询，现调研一下Mondrian对hive的支持情况。

环境准备

hive环境，采用hive-0.10-cdh4.2.1 客户端程序使用的类库：mondrian-3.6.0、olap4j-1.2.0-SNAPSHOT

数据准备

来源于网上一个数据源，准备四张表 Customer - 客户信息维表 Product - 产品维表 ProductType - 产品类表维表 Sale - 销售记录表为了方便测试数据与MDX正确性，将数据导入到MySQL中一份，用来与Hive查询结果进行对比。

在MySQL创建对应表及数据

具体SQL语句：

create database hive_test;

use hive_test;

/**用户信息表*/

create table Customer (

       cusId int not null,

       gender char(1) null,

       constraint PK_CUSTOMER primary key(cusId)

);

/**产品表*/

create table Product (

       proId int not null,

       proTypeId int null,

       proName varchar(32) null,

       constraint PK_PRODUCT primary key(proId)

);

/**产品类别表*/

create table ProductType (

       proTypeId int not null,

       proTypeName varchar(32) null,

       constraint PK_PRODUCTTYPE primary key (proTypeId)

);

/**销售记录表/

create table Sale (

       saleId int not null,

       proId int null,

       cusId int null,

       unitPrice float null,

       number int null,

       constraint PK_SALE primary key(saleId)

);

insert into Customer(cusId,gender) values(1,'F');

insert into Customer(cusId,gender) values(2,'M');

insert into Customer(cusId,gender) values(3,'M');

insert into Customer(cusId,gender) values(4,'F');

insert into ProductType(proTypeId,proTypeName) values(1,'electrical');

insert into ProductType(proTypeId,proTypeName) values(2,'digital');

insert into ProductType(proTypeId,proTypeName) values(3,'furniture');

insert into Product(proId,proTypeId,proName) values(1,1,'washing machine');

insert into Product(proId,proTypeId,proName) values(2,1,'television');

insert into Product(proId,proTypeId,proName) values(3,2,'mp3');

insert into Product(proId,proTypeId,proName) values(4,2,'mp4');

insert into Product(proId,proTypeId,proName) values(5,2,'camera');

insert into Product(proId,proTypeId,proName) values(6,3,'chair');

insert into Product(proId,proTypeId,proName) values(7,3,'desk');

insert into sale(saleId,proId,cusId,unitPrice,number) values(1,1,1,340.34,2);

insert into sale(saleId,proId,cusId,unitPrice,number) values(2,1,2,140.34,1);

insert into sale(saleId,proId,cusId,unitPrice,number) values(3,2,3,240.34,3);

insert into sale(saleId,proId,cusId,unitPrice,number) values(4,3,4,540.34,4);

insert into sale(saleId,proId,cusId,unitPrice,number) values(5,4,1,80.34,5);

insert into sale(saleId,proId,cusId,unitPrice,number) values(6,5,2,90.34,26);

insert into sale(saleId,proId,cusId,unitPrice,number) values(7,6,3,140.34,7);

insert into sale(saleId,proId,cusId,unitPrice,number) values(8,7,4,640.34,28);

insert into sale(saleId,proId,cusId,unitPrice,number) values(9,6,1,140.34,29);

insert into sale(saleId,proId,cusId,unitPrice,number) values(10,7,2,740.34,29);

insert into sale(saleId,proId,cusId,unitPrice,number) values(11,5,3,30.34,28);

insert into sale(saleId,proId,cusId,unitPrice,number) values(12,4,4,1240.34,72);

insert into sale(saleId,proId,cusId,unitPrice,number) values(13,3,1,314.34,27);

insert into sale(saleId,proId,cusId,unitPrice,number) values(14,3,2,45.34,27);

hive中测试数据准备

在虚拟机准备好hive测试环境，采用hive-0.10-cdh4.2.1版本具体语句：

create database mondrian;

use mondrian;

create table Sale (saleId INT, proId INT, cusId INT, unitPrice FLOAT, number INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

create table Product (proId INT, proTypeId INT, proName STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

create table ProductType (proTypeId INT, proTypeName STRING)   ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

create table Customer (cusId INT, gender STRING)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

# Customer文件

1,F

2,M

3,M

4,F

load data local inpath "/home/hzwangxx/cdh4/hive/myTmp/Customer" OVERWRITE into table Customer;

# ProductType文件

1,electrical

2,digital

3,furniture

load data local inpath "/home/hzwangxx/cdh4/hive/myTmp/ProductType" into table ProductType;

# Product数据文件

1,1,washing machine

2,1,television

3,2,mp3

4,2,mp4

5,2,camera

6,3,chair

7,3,desk

load data local inpath "/home/hzwangxx/cdh4/hive/myTmp/Product" into table Product;

# Sale数据文件

1,1,1,340.34,2

2,1,2,140.34,1

3,2,3,240.34,3

4,3,4,540.34,4

5,4,1,80.34,5

6,5,2,90.34,26

7,6,3,140.34,7

8,7,4,640.34,28

9,6,1,140.34,29

10,7,2,740.34,29

11,5,3,30.34,28

12,4,4,1240.34,72

13,3,1,314.34,27

14,3,2,45.34,27

load data local inpath "/home/hzwangxx/cdh4/hive/myTmp/Sale" into table Sale;

元数据定义

Cube、Measure等元数据定义见：

<Schema name="hello">

<Cube name="Sales">

<!--  事实表(fact table)  -->

<Table name="Sale"/>

<!--  客户维  -->

<Dimension name="cusGender" foreignKey="cusId">

<Hierarchy hasAll="true" allMemberName="allGender" primaryKey="cusId">

<Table name="Customer"/>

<Level name="gender" column="gender"/>

</Hierarchy>

</Dimension>

<!--  产品类别维  -->

<Dimension name="proType" foreignKey="proId">

<Hierarchy hasAll="true" allMemberName="allPro" primaryKey="proId" primaryKeyTable="Product">

<join leftKey="proTypeId" rightKey="proTypeId">

<Table name="Product"/>

<Table name="ProductType"/>

</join>

<Level name="proTypeId" column="proTypeId" nameColumn="proTypeName" uniqueMembers="true" table="ProductType"/>

<Level name="proId" column="proId" nameColumn="proName" uniqueMembers="true" table="Product"/>

</Hierarchy>

</Dimension>

<Measure name="numb" column="number" aggregator="sum" datatype="Numeric"/>

<Measure name="totalSale" aggregator="sum" formatString="$ #,##0.00">

<!--  unitPrice*number所得值的列  -->

<MeasureExpression>

<SQL dialect="generic">unitPrice*number</SQL>

</MeasureExpression>

</Measure>

<CalculatedMember name="averPri" dimension="Measures">

<Formula>[Measures].[totalSale] / [Measures].[numb]</Formula>

<CalculatedMemberProperty name="FORMAT_STRING" value="$ #,##0.00"/>

</CalculatedMember>

</Cube>

</Schema>

测试MDX

1. 查询所有类别产品销售总件数、平均价格和总销售额

"select " + "{[Measures].[numb],[Measures].[averPri],[Measures].[totalSale]} on columns," + "{([proType].[allPro],[cusGender].[allGender])} " + "on rows " + "from [Sales]"

对Hive支持情况

建立Connection连接方式

建立Connection连接方式有两种：

使用mondrian自带的DriverManager获取Connection实例

mondrian中自带的API

# 这里的Connection、DriverManager、Query、Result等都是mondrian提供的API接口

        Connection connection = DriverManager.getConnection(

                "Provider=mondrian;" +

                "Jdbc=jdbc:hive2://node02:10000/mondrian;" +

                "JdbcUser=;JdbcPassword=;" +

                "Catalog=/Users/apple/IdeaProjects/hbase-manage/src/main/resources/MiniMart.xml;" +

                "JdbcDrivers=org.apache.hive.jdbc.HiveDriver", null);

        Query query = connection.parseQuery(

                "select \n" +

                        "{[Measures].[numb],[Measures].[averPri],[Measures].[totalSale]} on columns,\n" +

                        "{([proType].[allPro],[cusGender].[allGender])} \n" +

                        "on rows\n" +

                        "from [Sales]\n");

        @SuppressWarnings("deprecation")

        Result result = connection.execute(query);

        PrintWriter pw = new PrintWriter(System.out);

        result.print(pw);

        pw.flush();

　　对应的连接MySQL,只需要将getConnection中的connectString换成如下即可：

Connection connection =  DriverManager.getConnection(

                "Provider=mondrian;" +

                        "Jdbc=jdbc:mysql://localhost:3306/hive_test; JdbcUser=root;" +

                        "JdbcPassword=123;" +

                        "Catalog=/Users/apple/IdeaProjects/hbase-manage/src/main/resources/MiniMart.xml;" +

                        "JdbcDrivers=com.mysql.jdbc.Driver", null);

　　测试的时候连接MySQL时，没什么问题，在使用相同的API连Hive的时候，有点问题。down了一下源码发现它的过程是这样的：先去连接池中取一个Connection实例，没有的话通过Factory创建一个Connection放入池里。而在Mondrian创建Factory的时候指定了两个属性：autoCommit和readOnly，RDBMS的Driver都没什么问题，Hive的JDBC提供的HiveConnection中对这两个属性的set方法实现得很诡异,都是直接抛异常了：

  public void setReadOnly(boolean readOnly) throws SQLException {

    // TODO Auto-generated method stub

    throw new SQLException("Method not supported");

  }

  public void setAutoCommit(boolean autoCommit) throws SQLException {

    if (autoCommit) {

      throw new SQLException("enabling autocommit is not supported");

    }

  }

将这两行抛出异常的地方注释掉，rebuild一下jar包，MDX就可以顺利执行完了。

使用JDK原生的DriverManager获取Connection实例

　　可以使用JDK原生的DriverManager获取Connection然后再使用Olap4j的封装成OLapConnection然后再去执行MDX 具体连接示例如下：

 Class.forName("mondrian.olap4j.MondrianOlap4jDriver");

 Connection nativeConn = DriverManager.getConnection("jdbc:mondrian:Jdbc=jdbc:hive2://node02:10000/mondrian; JdbcUser=;" +

         "JdbcPassword=;" +

         "Catalog=/Users/apple/IdeaProjects/hbase-manage/src/main/resources/MiniMart.xml;" +

         "JdbcDrivers=org.apache.hive.jdbc.HiveDriver");

 OlapConnection olapConn = nativeConn.unwrap(OlapConnection.class);

 if (olapConn == null) {

     throw new IllegalStateException("Connection is null");

 }

 OlapStatement statement = olapConn.createStatement();

 CellSet cellSet = statement.executeOlapQuery("select " +

         "{[Measures].[numb],[Measures].[averPri],[Measures].[totalSale]} on columns," +

         "{([proType].[allPro],[cusGender].[allGender])} " +

         "on rows " +

         "from [Sales]") ;

 //formatter.

 RectangularCellSetFormatter formatter =

         new RectangularCellSetFormatter(false);

 // Print out.

 PrintWriter writer = new PrintWriter(System.out);

 formatter.format(cellSet, writer);

 writer.flush();

 statement.close();

 olapConn.close();

 nativeConn.close();

指定Database

　　hive也有类似RDBMS一样有database的概念，在Hive提供的普通Java API中虽然在连接字符串中指定了database，但是它默认的并非你指定的database而是上一次当前客户端或线程使用的database（注：并非default），所以一般使用Hive 客户端必须先执行一下use database。而在OlapConnection和Mondrian提供的Connection都不支持"use database"操作。暂时的解决办法，每次去进行MDX查询的时候先通过普通的Java Api执行一下use database，指定到当前需要查询的数据库中。