【Hive学习之八】Hive 调优【重要】

环境
　　虚拟机：VMware 10
　　Linux版本：CentOS-6.5-x86_64
　　客户端：Xshell4
　　FTP：Xftp4
　　jdk8
　　hadoop-3.1.1
　　apache-hive-3.1.1

一、执行计划
核心思想：把Hive SQL当做Mapreduce程序去优化
以下SQL不会转为Mapreduce来执行
　　-select仅查询本表字段
　　-where仅对本表字段做条件过滤

Explain 显示执行计划：EXPLAIN [EXTENDED] query

hive> explain select count(*) from psn2;

OK

Explain

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1

STAGE PLANS:

  Stage: Stage-

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: psn2

            Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

            Select Operator

              Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

              Group By Operator

                aggregations: count()

                mode: hash

                outputColumnNames: _col0

                Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

                Reduce Output Operator

                  sort order:

                  Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

                  value expressions: _col0 (type: bigint)

      Execution mode: vectorized

      Reduce Operator Tree:

        Group By Operator

          aggregations: count(VALUE._col0)

          mode: mergepartial

          outputColumnNames: _col0

          Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

          File Output Operator

            compressed: false

            Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

            table:

                input format: org.apache.hadoop.mapred.SequenceFileInputFormat

                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-

    Fetch Operator

      limit: -

      Processor Tree:

        ListSink

Time taken: 2.7 seconds, Fetched:  row(s)

hive>

hive> explain extended select count(*) from psn2;

OK

Explain

STAGE DEPENDENCIES:

  Stage- is a root stage

  Stage- depends on stages: Stage-

STAGE PLANS:

  Stage: Stage-

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: psn2

            Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

            GatherStats: false

            Select Operator

              Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

              Group By Operator

                aggregations: count()

                mode: hash

                outputColumnNames: _col0

                Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

                Reduce Output Operator

                  null sort order:

                  sort order:

                  Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

                  tag: -

                  value expressions: _col0 (type: bigint)

                  auto parallelism: false

      Execution mode: vectorized

      Path -> Alias:

        hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10 [psn2]

        hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20 [psn2]

      Path -> Partition:

        hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10

          Partition

            base file name: age=

            input format: org.apache.hadoop.mapred.TextInputFormat

            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

            partition values:

              age

            properties:

              bucket_count -

              collection.delim -

              column.name.delimiter ,

              columns id,name,likes,address

              columns.comments

              columns.types int:string:array<string>:map<string,string>

              field.delim ,

              file.inputformat org.apache.hadoop.mapred.TextInputFormat

              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

              line.delim 

              location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10

              mapkey.delim :

              name default.psn2

              numFiles 1

              numRows 0

              partition_columns age

              partition_columns.types int

              rawDataSize 0

              serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}

              serialization.format ,

              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              totalSize 372

              transient_lastDdlTime 1548986286

            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              input format: org.apache.hadoop.mapred.TextInputFormat

              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

              properties:

                bucket_count -

                bucketing_version

                collection.delim -

                column.name.delimiter ,

                columns id,name,likes,address

                columns.comments

                columns.types int:string:array<string>:map<string,string>

                field.delim ,

                file.inputformat org.apache.hadoop.mapred.TextInputFormat

                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                line.delim 

                location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2

                mapkey.delim :

                name default.psn2

                partition_columns age

                partition_columns.types int

                serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}

                serialization.format ,

                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

                transient_lastDdlTime

              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              name: default.psn2

            name: default.psn2

        hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20

          Partition

            base file name: age=

            input format: org.apache.hadoop.mapred.TextInputFormat

            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

            partition values:

              age

            properties:

              bucket_count -

              collection.delim -

              column.name.delimiter ,

              columns id,name,likes,address

              columns.comments

              columns.types int:string:array<string>:map<string,string>

              field.delim ,

              file.inputformat org.apache.hadoop.mapred.TextInputFormat

              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

              line.delim 

              location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20

              mapkey.delim :

              name default.psn2

              numFiles

              numRows

              partition_columns age

              partition_columns.types int

              rawDataSize

              serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}

              serialization.format ,

              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              totalSize

              transient_lastDdlTime

            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              input format: org.apache.hadoop.mapred.TextInputFormat

              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

              properties:

                bucket_count -

                bucketing_version

                collection.delim -

                column.name.delimiter ,

                columns id,name,likes,address

                columns.comments

                columns.types int:string:array<string>:map<string,string>

                field.delim ,

                file.inputformat org.apache.hadoop.mapred.TextInputFormat

                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                line.delim 

                location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2

                mapkey.delim :

                name default.psn2

                partition_columns age

                partition_columns.types int

                serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}

                serialization.format ,

                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

                transient_lastDdlTime

              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              name: default.psn2

            name: default.psn2

      Truncated Path -> Alias:

        /psn2/age= [psn2]

        /psn2/age= [psn2]

      Needs Tagging: false

      Reduce Operator Tree:

        Group By Operator

          aggregations: count(VALUE._col0)

          mode: mergepartial

          outputColumnNames: _col0

          Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

          File Output Operator

            compressed: false

            GlobalTableId:

            directory: hdfs://PCS102:9820/tmp/hive/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_10-58-42_159_2637812497308639143-1/-mr-10001/.hive-staging_hive_2019-02-19_10-58-42_159_2637812497308639143-1/-ext-10002

            NumFilesPerFileSink:

            Statistics: Num rows:  Data size:  Basic stats: COMPLETE Column stats: NONE

            Stats Publishing Key Prefix: hdfs://PCS102:9820/tmp/hive/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_10-58-42_159_2637812497308639143-1/-mr-10001/.hive-staging_hive_2019-02-19_10-58-42_159_2637812497308639143-1/-ext-10002/

            table:

                input format: org.apache.hadoop.mapred.SequenceFileInputFormat

                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

                properties:

                  columns _col0

                  columns.types bigint

                  escape.delim \

                  hive.serialization.extend.additional.nesting.levels true

                  serialization.escape.crlf true

                  serialization.format

                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

            TotalFiles:

            GatherStats: false

            MultiFileSpray: false

  Stage: Stage-

    Fetch Operator

      limit: -

      Processor Tree:

        ListSink

Time taken: 0.142 seconds, Fetched:  row(s)

hive>

二、运行模式
(1)分为本地模式和集群模式
(2)开启本地模式（对于数据量少的表情况）：
　　set hive.exec.mode.local.auto=true;
注意：
　　hive.exec.mode.local.auto.inputbytes.max默认值为128M
　　表示加载文件的最大值，若大于该配置仍会以集群方式来运行！

hive> set hive.exec.mode.local.auto=true;

hive> select count(*)  from psn21;

Automatically selecting local only mode for query

Query ID = root_20190219144810_0bafff9e-1c40-45f6-b687-60c5d13c9f0c

Total jobs =

Launching Job  out of

Number of reduce tasks determined at compile time:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

-- ::, Stage- map = %,  reduce = %

Ended Job = job_local1827024396_0002

MapReduce Jobs Launched:

Stage-Stage-:  HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  msec

OK

_c0

Time taken: 1.376 seconds, Fetched:  row(s)

hive> set hive.exec.mode.local.auto=false;

hive> select count(*)  from psn21;

Query ID = root_20190219144841_6fd11106-5db1--8b0b-884697b558df

Total jobs =

Launching Job  out of

Number of reduce tasks determined at compile time:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1548397153910_0013, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0013/

Kill Command = /usr/local/hadoop-3.1./bin/mapred job  -kill job_1548397153910_0013

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 6.28 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1548397153910_0013

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 6.28 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

_c0

Time taken: 18.923 seconds, Fetched:  row(s)

hive>

三、并行计算
通过设置以下参数开启并行模式（需要关闭本地模式）：
set hive.exec.parallel=true;

另：hive.exec.parallel.thread.number:一次SQL计算中允许并行执行的job个数的最大值

hive> set hive.exec.parallel;

hive.exec.parallel=false

hive> select t1.cnt1,t2.cnt2 from

    > (select count(id) cnt1 from psn21) t1,

    > (select count(name) cnt2 from psn21)t2;

Warning: Map Join MAPJOIN[][bigTable=?] in task 'Stage-4:MAPRED' is a cross product

Warning: Map Join MAPJOIN[][bigTable=?] in task 'Stage-5:MAPRED' is a cross product

Warning: Shuffle Join JOIN[][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product

Query ID = root_20190219145608_b4f3d4e9-b858-41be-9ddc-6eccff0ec9d9

Total jobs =

Launching Job 1 out of 5

Number of reduce tasks determined at compile time:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1548397153910_0014, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0014/

Kill Command = /usr/local/hadoop-3.1./bin/mapred job  -kill job_1548397153910_0014

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.85 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 6.04 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1548397153910_0014

Launching Job 2 out of 5

Number of reduce tasks determined at compile time:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1548397153910_0015, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0015/

Kill Command = /usr/local/hadoop-3.1./bin/mapred job  -kill job_1548397153910_0015

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.8 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 5.92 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1548397153910_0015

Stage- is selected by condition resolver.

Stage- is filtered out by condition resolver.

Stage- is filtered out by condition resolver.

SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.-bin/lib/log4j-slf4j-impl-2.10..jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

-- ::    Starting to launch local task to process map join;    maximum memory =

-- ::    Dump the side-table for tag:  with group count:  into file: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019--19_14--08_997_6748376838876035123-/-local-/HashTable-Stage-/MapJoin-mapfile01--.hashtable2019-- ::    Uploaded  File to: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019--19_14--08_997_6748376838876035123-/-local-/HashTable-Stage-/MapJoin-mapfile01--.hashtable ( bytes)

Execution completed successfully

MapredLocal task succeeded

Launching Job 4 out of 5

Number of reduce tasks is set to  since there's no reduce operator

Starting Job = job_1548397153910_0016, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0016/

Kill Command = /usr/local/hadoop-3.1./bin/mapred job  -kill job_1548397153910_0016

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 3.07 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1548397153910_0016

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 6.04 sec   HDFS Read:  HDFS Write:  SUCCESS

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 5.92 sec   HDFS Read:  HDFS Write:  SUCCESS

Stage-Stage-: Map:    Cumulative CPU: 3.07 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

t1.cnt1    t2.cnt2

Time taken: 59.527 seconds, Fetched:  row(s)

hive> set hive.exec.parallel=true;

hive> (select count(name) cnt2 from psn21)t2;

FAILED: ParseException line : extraneous input 't2' expecting EOF near '<EOF>'

hive> select t1.cnt1,t2.cnt2 from

    > (select count(id) cnt1 from psn21) t1,

    > (select count(name) cnt2 from psn21)t2;

Warning: Map Join MAPJOIN[][bigTable=?] in task 'Stage-4:MAPRED' is a cross product

Warning: Map Join MAPJOIN[][bigTable=?] in task 'Stage-5:MAPRED' is a cross product

Warning: Shuffle Join JOIN[][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product

Query ID = root_20190219145918_2f98437b--41a4-905b-4c6a3a160d46

Total jobs =

Launching Job 1 out of 5

Launching Job 2 out of 5

Number of reduce tasks determined at compile time:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Number of reduce tasks determined at compile time:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1548397153910_0018, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0018/

Kill Command = /usr/local/hadoop-3.1./bin/mapred job  -kill job_1548397153910_0018

Starting Job = job_1548397153910_0017, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0017/

Kill Command = /usr/local/hadoop-3.1./bin/mapred job  -kill job_1548397153910_0017

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 3.0 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 6.25 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1548397153910_0018

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.74 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 5.9 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1548397153910_0017

Stage- is selected by condition resolver.

Stage- is filtered out by condition resolver.

Stage- is filtered out by condition resolver.

SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.-bin/lib/log4j-slf4j-impl-2.10..jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

-- ::    Dump the side-table for tag:  with group count:  into file: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019--19_14--18_586_8660726948780795909-/-local-/HashTable-Stage-/MapJoin-mapfile21--.hashtable2019-- ::    Uploaded  File to: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019--19_14--18_586_8660726948780795909-/-local-/HashTable-Stage-/MapJoin-mapfile21--.hashtable ( bytes)

Execution completed successfully

MapredLocal task succeeded

Launching Job 4 out of 5

Number of reduce tasks is set to  since there's no reduce operator

Starting Job = job_1548397153910_0019, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0019/

Kill Command = /usr/local/hadoop-3.1./bin/mapred job  -kill job_1548397153910_0019

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.95 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1548397153910_0019

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 6.25 sec   HDFS Read:  HDFS Write:  SUCCESS

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 5.9 sec   HDFS Read:  HDFS Write:  SUCCESS

Stage-Stage-: Map:    Cumulative CPU: 2.95 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

t1.cnt1    t2.cnt2

Time taken: 64.206 seconds, Fetched:  row(s)

hive>

四、严格模式
通过设置以下参数开启严格模式：
set hive.mapred.mode=strict;
（默认为：nonstrict非严格模式）

查询限制：
1、对于分区表，必须添加where对于分区字段的条件过滤；

hive> set hive.mapred.mode=nonstrict;

hive> select * from psn22;

OK

psn22.id    psn22.name    psn22.likes    psn22.address    psn22.age    psn22.sex

    小明1    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}        boy

    小明2    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}        man

    小明5    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}        boy

    小明3    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}        boy

    小明6    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}        man

    小明4    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}        man

Time taken: 0.186 seconds, Fetched:  row(s)

hive> set hive.mapred.mode=strict;

hive> select * from psn22;

FAILED: SemanticException [Error ]: Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.no.partition.filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "psn22" Table "psn22"

hive> select * from psn22 where age= and sex='boy';

OK

psn22.id    psn22.name    psn22.likes    psn22.address    psn22.age    psn22.sex

    小明1    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}        boy

Time taken: 0.282 seconds, Fetched:  row(s)

hive>

2、order by语句必须包含limit输出限制

hive> set hive.mapred.mode=strict;

hive> select * from psn21 order by id;

FAILED: SemanticException : Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'id'

hive> select * from psn21 order by id limit ;

Automatically selecting local only mode for query

Query ID = root_20190219143842_b465a76f-a890-4bdc-aa76-b713c3ea13c0

Total jobs =

Launching Job  out of

Number of reduce tasks determined at compile time:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

-- ::, Stage- map = %,  reduce = %

Ended Job = job_local1585589360_0001

MapReduce Jobs Launched:

Stage-Stage-:  HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  msec

OK

psn21.id    psn21.name    psn21.age    psn21.sex    psn21.likes    psn21.address

    小明1        boy    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}

    小明2        man    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}

Time taken: 1.89 seconds, Fetched:  row(s)

hive>

3、限制执行笛卡尔积的查询。

五、Hive排序
-Order By - 对于查询结果做全排序，只允许有一个reduce处理
（当数据量较大时，应慎用。严格模式下，必须结合limit来使用）
-Sort By - 对于单个reduce的数据进行排序
-Distribute By - 分区排序，经常和Sort By结合使用
-Cluster By - 相当于 Sort By + Distribute By
（Cluster By不能通过asc、desc的方式指定排序规则；
可通过 distribute by column sort by column asc|desc 的方式）

六、Hive Join（去掉MapReduce中的shuffle）
Join计算时，将小表（驱动表）放在join的左边
Map Join：在Map端完成Join
两种实现方式：
1、SQL方式，在SQL语句中添加MapJoin标记（mapjoin hint）
语法：
SELECT /*+ MAPJOIN(smallTable) */ smallTable.key, bigTable.value
FROM smallTable JOIN bigTable ON smallTable.key = bigTable.key;
2、开启自动的MapJoin
通过修改以下配置启用自动的mapjoin：
set hive.auto.convert.join = true;
（该参数为true时，Hive自动对左边的表统计量，如果是小表就加入内存，即对小表使用Map join）

相关配置参数：
hive.mapjoin.smalltable.filesize;
（大表小表判断的阈值，如果表的大小小于该值则会被加载到内存中运行）
hive.ignore.mapjoin.hint；
（默认值：true；是否忽略mapjoin hint 即mapjoin标记）
hive.auto.convert.join.noconditionaltask;
（默认值：true；将普通的join转化为普通的mapjoin时，是否将多个mapjoin转化为一个mapjoin）
hive.auto.convert.join.noconditionaltask.size;
（将多个mapjoin转化为一个mapjoin时，其表的最大值）

七、Map-Side聚合（相当于MapReduce中的combine聚合）
通过设置以下参数开启在Map端的聚合：
set hive.map.aggr=true;

相关配置参数：
hive.groupby.mapaggr.checkinterval：
map端group by执行聚合时处理的多少行数据（默认：100000）
hive.map.aggr.hash.min.reduction：
进行聚合的最小比例（预先对100000条数据做聚合，若聚合之后的数据量/100000的值大于该配置0.5，则不会聚合）
hive.map.aggr.hash.percentmemory：
map端聚合使用的内存的最大值
hive.map.aggr.hash.force.flush.memory.threshold：
map端做聚合操作是hash表的最大可用内容，大于该值则会触发flush
hive.groupby.skewindata
是否对GroupBy产生的数据倾斜做优化，默认为false

八、控制Hive中Map以及Reduce的数量
Map数量相关的参数
mapred.max.split.size
一个split的最大值，即每个map处理文件的最大值
mapred.min.split.size.per.node
一个节点上split的最小值
mapred.min.split.size.per.rack
一个机架上split的最小值

Reduce数量相关的参数：
mapred.reduce.tasks
强制指定reduce任务的数量
hive.exec.reducers.bytes.per.reducer
每个reduce任务处理的数据量
hive.exec.reducers.max
每个任务最大的reduce数

九、Hive - JVM重用
适用场景：
1、小文件个数过多
2、task个数过多

通过 set mapred.job.reuse.jvm.num.tasks=n; 来设置（n为task插槽个数）
缺点：设置开启之后，task插槽会一直占用资源，不论是否有task运行，
直到所有的task即整个job全部执行完成时，才会释放所有的task插槽资源！

【Hive学习之八】Hive 调优【重要】的更多相关文章

Spark学习之Spark调优与调试（7）
Spark学习之Spark调优与调试(7) 1. 对Spark进行调优与调试通常需要修改Spark应用运行时配置的选项. 当创建一个SparkContext时就会创建一个SparkConf实例. 2. ...
大数据技术之_08_Hive学习_04_压缩和存储（Hive高级）+ 企业级调优（Hive优化）
第8章压缩和存储(Hive高级)8.1 Hadoop源码编译支持Snappy压缩8.1.1 资源准备8.1.2 jar包安装8.1.3 编译源码8.2 Hadoop压缩配置8.2.1 MR支持的压缩 ...
hive优化之参数调优
1.hive参数优化之默认启用本地模式启动hive本地模式参数,一般建议将其设置为true,即时刻启用: hive (chavin)> set hive.exec.mode.local.aut ...
Hive Tuning(五) 标准调优清单
Hive的标准调优清单,我们可以对照着来做我们的查询优化!
Hive(十二)【调优】
目录 1.Fetch抓取 2.本地模式 3.表的优化 3.1大小表join 3.2大表Join大表 3.3map join 3.4group By 3.5 count(distinct) 3.6笛卡尔 ...
Spark学习之Spark调优与调试(二)
下面来看看更复杂的情况,比如,当调度器进行流水线执行(pipelining),或把多个 RDD 合并到一个步骤中时.当RDD 不需要混洗数据就可以从父节点计算出来时,调度器就会自动进行流水线执行.上一 ...
Spark学习之Spark调优与调试(一)
一.使用SparkConf配置Spark 对 Spark 进行性能调优,通常就是修改 Spark 应用的运行时配置选项.Spark 中最主要的配置机制是通过 SparkConf 类对 Spark 进行 ...
hive学习(二) hive操作
hive ddl 操作官方手册https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL hive dml 操作官方手 ...
hive sql的参数调优
shuffle优化之减少shuffle数据量 1.谓词下推 hive.optimize.ppd ,默认为true. 所谓谓词下推就是过滤条件如果写在shuffle操作后面,就提前过滤掉,减少参与sh ...

随机推荐

Kubernetes应用管理
除了容器资源管理和调度,Kubernetes另外一个核心价值是提供了针对不同类型应用管理的API接口集合,这些API集合把针对不同类型应用的管理能力分别到Kubernetes平台中.以Web业务(Lo ...
openssh源码分析笔记
1.客户端保活: options.client_alive_interval options.client_alive_count_max 在wait_until_can_do_something() ...
拦截器的作用之session认证登录和资源拦截
背景: 在项目中我使用了自定义的Filter 这时候过滤了很多路径,当然对静态资源我是直接放过去的,但是,还是出现了静态资源没办法访问到springboot默认的文件夹中得文件.另外,经常需要判断当前 ...
MATLAB中产生随机数的那些函数
1.产生从imin~imax的m*n矩阵 randi([imin,imax],m,n); 2.产生1~n的无重复随机整数 randperm(n);
MySQL的nnodb引擎表数据分区存储
Symlinks are fully supported only for MyISAM tables. 对应Innodb引擎数据文件放到其他目录 mysql> SHOW VARIABLES L ...
LongAdder，AtomicIntegerFieldUpdater深入研究
从LongAdder看更高效的无锁实现 AtomicIntegerFieldUpdater字段原子更新类 div:not([id]){display:none;} --> ul{padding: ...
redis实现消息队列(七)
1. 介绍 redis有一个数据类型叫list(列表),它的每个子元素都是 string 类型的双向链表.我们可以通过 push,pop 操作从链表的头部或者尾部添加删除元素.这使得 list 既可以 ...
PHPUnit单元测试的简单使用
何为单元测试: 指对软件中的基本单元进行测试,如函数.方法等,以检查其返回值或行为是否符合预期:实际中软件是很复杂的,由许多组件构成,执行流程连贯在一起,要进行单元片段的测试,就需要为其提供执行上下文 ...
Py中的矩阵乘法【转载】
转自:https://blog.csdn.net/cqk0100/article/details/76221749 1.总结对于array对象,*和np.multiply函数代表的是数量积,如果希望 ...
element后太侧边
$router 是已经在ruterJs里面定义好的路由以及组件然后取值赋予进去就是了.但是真正的写法应该是这样,, 执行点击事件的时候直接让跟换路由., 让后面 router-view 里面路由 ...

【Hive学习之八】Hive 调优【重要】

【Hive学习之八】Hive 调优【重要】的更多相关文章

随机推荐

热门专题