----------------------------------------
博文作者:迦壹
转载声明:可以转载, 但必须以超链接形式标 明文章原始出处和作者信息及版权声明,谢谢合作!
---------------------------------------
 
目录:
  一、Nginx的相关配置,以及测试数据
  二、Hadoop中创建使用目录
  三、Hive的日志表创建,并同步到Hbase中
    1)创建自增长ID的类,并添加到Hive环境中
    2)在Hive中创建日志表
    3)将Nginx的日志数据导入到Hive日志表中
      方法一:将本地目录中的文件导入到hive中
      方法二:从HDFS导入数据
      方法三:通过如Flume、Kafka一类的工具监控日志数据,自动导入
    4)在Hive中创建Hbase可以识别的表
    5)将Hive的日志表,使用自增长Id做为rowkey导入到Hbase可识别的表中
    6)到Hbase中验证效果使用get、scan、filter获取指定数据
  四、FAQ
  五、扩展阅读
 
  一、Nginx的相关配置,以及测试数据
  #Nginx的Conf中log格式化部分的内容
1
2
3
4
5
log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                                '$status $body_bytes_sent "$http_referer" '
                                '"$http_user_agent" "$http_x_forwarded_for" '
                                '"$gzip_ratio" $request_time $bytes_sent $request_length '
                                '"$upstream_addr" $upstream_status $upstream_response_time';
  #Nginx生成后部分格式如下(测试数据)
1
2
3
4
5
6
7
8
9
10
11
12
13
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET / HTTP/1.1" 200 2373 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "2.78" 0.004 2683 369 "unix:/var/run/php5-fpm.sock" 200 0.004
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/css/base.css HTTP/1.1" 200 940 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 1247 373 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/a3e2e507/jquery.min.js HTTP/1.1" 200 93636 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.152 93976 359 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/logo.png HTTP/1.1" 200 6059 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 6369 377 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p02.jpg HTTP/1.1" 200 22177 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 22489 376 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p03.png HTTP/1.1" 200 3012 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 3321 376 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/two-dimension-code1.png HTTP/1.1" 200 761 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 1069 392 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/bg.png HTTP/1.1" 200 11474 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 11785 375 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p04.png HTTP/1.1" 200 2860 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.000 3169 376 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p06.png HTTP/1.1" 200 74097 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.062 74409 376 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:45 +0800] "GET 8.8.8.8/b519d8ca/image/p05.png HTTP/1.1" 200 132072 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.256 132385 376 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:46 +0800] "GET 8.8.8.8/b519d8ca/image/p07.png HTTP/1.1" 200 207987 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.592 208300 376 "-" - -
8.8.8.8 - - [22/Aug/2014:20:23:46 +0800] "GET 8.8.8.8/b519d8ca/image/p01.png HTTP/1.1" 200 310418 "http://xxx.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36" "-" "-" 0.918 310731 376 "-" - -
 
  二、Hadoop中创建使用目录
  #在Hadoop2.2.0中创建Hive外部表中要用到的目录
1
2
3
4
5
root@m1:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -mkdir /user/hive/warehouse/nginxlog
root@m1:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - root supergroup          0 2014-01-22 23:13 /user/hive/warehouse/nginxlog
root@m1:/home/hadoop#
  三、Hive的日志表创建,并同步到Hbase中
    1)创建自增长ID的java类UDFRowSequence.java,并添加到Hive环境中
  #在Eclipse中创建Maven项目,然后使用Maven将项目打包成Jar文件,过程中需要引入hadoop-common-2.2.0.jar、hive-exec-0.13.1.jar这两个文件,可以在Hadoop2.2.0目录以及Hive0.13.1目录中找到。如果不会使用Maven将项目打包,可以参考这篇文章《Golang、Php、Python、Java基于Thrift0.9.1实现跨语言调用》中实现Java客户端部分,有如何打包的方法。
  UDFRowSequence.java代码如下,使用Maven会打包成一个idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar文件,上传到Hive的lib目录下。本文后面会提供代码下载:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
package idoall.org.hive;
 
 
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;
 
/**
 * UDFRowSequence.
 */
@Description(name = "row_sequence",
    value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
  private LongWritable result = new LongWritable();
 
  public UDFRowSequence() {
    result.set(0);
  }
 
  public LongWritable evaluate() {
    result.set(result.get() + 1);
    return result;
  }
}
    2)在Hive中创建日志表
  #然后将我们自定义的idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar,添加到Hive运行环境中
1
2
3
4
hive> ADD JAR /home/hadoop/hive-0.13.1/lib/idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar;
Added /home/hadoop/hive-0.13.1/lib/idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar to class path
Added resource: /home/hadoop/hive-0.13.1/lib/idoall.org-0.0.1-SNAPSHOT-jar-with-dependencies.jar
hive>
  #在Hive中创建自定义函数对数据进行处理,如果不想每次都执行这个命令,可以在hive-site.xml里面的<name>hive.aux.jars.path</name> 节点下面,把jar文件加到<value>节点中。
1
2
3
hive> CREATE TEMPORARY FUNCTION rowSequence AS 'idoall.org.hive.UDFRowSequence';
OK
Time taken: 0.048 seconds
  #在Hive中创建收集Nginx的日志表(nginx_accesslog)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
create external table nginx_accesslog(
host string,
hostuser string,
times string,
requestmethond string,
requesturl string,
requesthttp string,
status string,
body_bytes_sent string,
referer string,
useragent string,
http_x_forwarded_for string,
gzip_ratio string,
request_time string,
bytes_sent string,
request_length string,
upstream_addr string,
upstream_status string,
upstream_response_time string)
PARTITIONED BY(YEAR STRING, MONTH STRING, DAY STRING)
row format SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^ ]*)\\s+-\\s+(.+?|-)\\s+\\[(.*)\\]\\s+\"([^ ]*)\\s+([^ ]*)\\s+([^ |\"]*)\"\\s+(-|[0-9]*)\\s+(-|[0-9]*)\\s+\"(.+?|-)\"\\s+\"(.+?|-)\"\\s+\"(.+?|-)\"\\s+\"(.+?|-)\"\\s+(.+?|-)\\s+(.+?|-)\\s+(.+?|-)\\s+\"(.+?|-)\"\\s+(.+?|-)\\s+(.*)","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s")
STORED AS TEXTFILE
location '/user/hive/warehouse/nginxlog';
    3)将Nginx的日志数据导入到Hive日志表中
关于导入数据过程中overwrite的特性:
•目标表(或者分区)中的内容(如果有)会被删除,然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。
•如果目标表(分区)已经有一个文件,并且文件名和 filepath 中的文件名冲突,那么现有的文件会被新文件所替代
如果不使用overwrite,改用into,只会增加记录。
 
      方法一:将本地目录中的文件导入到Hive中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
hive> LOAD DATA LOCAL INPATH '/home/hadoop/hive-0.13.1/a.com.access.20140821.log' OVERWRITE INTO TABLE nginx_accesslog partition (YEAR='2014', MONTH='08',DAY='21');
Copying data from file:/home/hadoop/hive-0.13.1/a.com.access.20140821.log
Copying file: file:/home/hadoop/hive-0.13.1/a.com.access.20140821.log
Loading data to table default.nginx_accesslog partition (year=2014, month=08, day=21)
Partition default.nginx_accesslog{year=2014, month=08, day=21} stats: [numFiles=1, numRows=0, totalSize=3483, rawDataSize=0]
OK
Time taken: 1.046 seconds
hive> select count(0) from nginx_accesslog;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1408550631561_0005, Tracking URL = http://m1:8088/proxy/application_1408550631561_0005/
Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1408550631561_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-08-22 23:19:55,322 Stage-1 map = 0%,  reduce = 0%
2014-08-22 23:20:01,669 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.74 sec
2014-08-22 23:20:08,926 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.59 sec
MapReduce Total cumulative CPU time: 1 seconds 590 msec
Ended Job = job_1408550631561_0005
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.59 sec   HDFS Read: 3734 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 590 msec
OK
13
Time taken: 24.762 seconds, Fetched: 1 row(s)
hive>
 
      方法二:从HDFS导入数据,我们先将表删除,重新创建一次,然后尝试从HDFS导入数据
  #先用Hadoop命令,将文件上传到HDFS中
1
2
3
4
5
root@m1:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -copyFromLocal /home/hadoop/hive-0.13.1/a.com.access.20140821.log /user/hive/warehouse/nginxlog/
root@m1:/home/hadoop# /home/hadoop/hadoop-2.2.0/bin/hadoop fs -ls /user/hive/warehouse/nginxlog
Found 1 items
-rw-r--r--   3 root supergroup       3483 2014-08-22 23:18 /user/hive/warehouse/nginxlog/a.com.access.20140821.log
root@m1:/home/hadoop#
 
  #执行Hive命令,从HDFS导入数据,然后使用SELECT语句,可以看到Nginx的日志数据已经成功导入到Hive表中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#先删除表
hive> drop table nginx_accesslog;
OK
Time taken: 0.363 seconds
hive>
#再创建表
....此处省略,参考上文命令重新创建一次
#从HDFS导入数据(如果文件存在,要先删除),从下图可以看到,数据导入成功
hive> LOAD DATA inpath '/user/hive/warehouse/nginxlog/a.com.access.20140821.log' overwrite INTO TABLE nginx_accesslog partition (YEAR='2014', MONTH='08',DAY='21');
Loading data to table default.nginx_accesslog partition (year=2014, month=08, day=21)
Partition default.nginx_accesslog{year=2014, month=08, day=21} stats: [numFiles=1, numRows=0, totalSize=3483, rawDataSize=0]
OK
Time taken: 0.373 seconds
hive> select * from nginx_accesslog limit 100;
OK
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     /       HTTP/1.1        200     2373    -       Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       2.78    0.004   2683    369     unix:/var/run/php5-fpm.sock     200     0.004   2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/css/base.css   HTTP/1.1        200     940     http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.000   1247    373     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/a3e2e507/jquery.min.js  HTTP/1.1        200     93636   http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.152   93976   359     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/logo.png HTTP/1.1        200     6059    http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.000   6369    377     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/p02.jpg  HTTP/1.1        200     22177   http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.000   22489   376     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/p03.png  HTTP/1.1        200     3012    http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.000   3321    376     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/two-dimension-code1.png  HTTP/1.1        200     761     http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.000   1069    392     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/bg.png   HTTP/1.1        200     11474   http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.000   11785   375     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/p04.png  HTTP/1.1        200     2860    http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.000   3169    376     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/p06.png  HTTP/1.1        200     74097   http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.062   74409   376     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:45 +0800      GET     8.8.8.8/b519d8ca/image/p05.png  HTTP/1.1        200     132072  http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.256   132385  376     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:46 +0800      GET     8.8.8.8/b519d8ca/image/p07.png  HTTP/1.1        200     207987  http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.592   208300  376     -       -       -       2014    08      21
8.8.8.8 -       22/Aug/2014:20:23:46 +0800      GET     8.8.8.8/b519d8ca/image/p01.png  HTTP/1.1        200     310418  http://xxx.com/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36 -       -       0.918   310731  376     -       -       -       2014    08      21
Time taken: 0.056 seconds, Fetched: 13 row(s)
hive>
 
      方法三:通过如Flume、Kafka一类的工具监控日志数据,自动导入
   就不在这里细说,可以参考这两篇文章里面的实例,根据自己的业务需求修改代码:《Flume1.5.0的安装、部署、简单应用(含分布式、与hadoop2.2.0、hbase0.96的案例)》、《Flume+Kafka+Strom基于分布式环境的结合使用
 
    4)在Hive中创建Hbase可以识别的表
  #在Hive中创建Hbase可以识别的表,相当于一张中间表了。同时将之前的测试数据导入到这张中间表,会自动同步到Hbase。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CREATE TABLE h2b_nginx_accesslog(
key int,
host string,
hostuser string,
times string,
requestmethond string,
requesturl string,
requesthttp string,
status string,
body_bytes_sent string,
referer string,
useragent string,
http_x_forwarded_for string,
gzip_ratio string,
request_time string,
bytes_sent string,
request_length string,
upstream_addr string,
upstream_status string,
upstream_response_time string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,log:host,log:hostuser,log:times,log:requestmethond,log:requesturl,log:requesthttp,log:status,log:body_bytes_sent,log:referer,log:useragent,log:http_x_forwarded_for,log:gzip_ratio,log:request_time,log:bytes_sent,log:request_length,log:upstream_addr,log:upstream_status,log:upstream_response_time") TBLPROPERTIES ("hbase.table.name" = "h2b_nginx_accesslog");
  #这时在Hbase中,可以看到h2b_nginx_accesslog表自动创建了;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
hbase(main):002:0> list
TABLE                                                                                                                                                                                                                                                   
h2b_nginx_accesslog                                                                                                                                                                                                         
1 row(s) in 0.1220 seconds
 
=> ["h2b_nginx_accesslog"]
/*  查看表结构时,只会显示列族,而不会显示列。Hbase表中的每个列,都归属与某个列族。列族是表的chema的一部分(而列不是)。*/
hbase(main):003:0> describe "h2b_nginx_accesslog"
DESCRIPTION                                                                                                                                ENABLED                                                                    
 'h2b_nginx_accesslog', {NAME => 'log', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', CO true                                                                       
 MPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false',                                                                            
  BLOCKCACHE => 'true'}                                                                                                                                                                                               
1 row(s) in 0.5890 seconds
 
hbase(main):004:0>
    5)将Hive的日志表,使用自增长Id做为rowkey导入到Hbase可识别的表中
  #将Hive的日志表(nginx_accesslog)导入到表(h2b_nginx_accesslog)中,会自动同步到hbase。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
insert overwrite table h2b_nginx_accesslog
select a.*
from (select
rowSequence(),
host,
hostuser,
times,
requestmethond,
requesturl,
requesthttp,
status,
body_bytes_sent,
referer,
useragent,
http_x_forwarded_for,
gzip_ratio,
request_time,
bytes_sent,
request_length,
upstream_addr,
upstream_status,
upstream_response_time
from nginx_accesslog) a;
  #导入到中间表时运行的结果,如果你也能够看到和下面类似的日志,恭喜你成功了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
hive> insert overwrite table h2b_nginx_accesslog
    > select a.*
    > from (select
    > rowSequence(),
    > host,
    > hostuser,
    > times,
    > requestmethond,
    > requesturl,
    > requesthttp,
    > status,
    > body_bytes_sent,
    > referer,
    > useragent,
    > http_x_forwarded_for,
    > gzip_ratio,
    > request_time,
    > bytes_sent,
    > request_length,
    > upstream_addr,
    > upstream_status,
    > upstream_response_time
    > from nginx_accesslog) a;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1408550631561_0017, Tracking URL = http://m1:8088/proxy/application_1408550631561_0017/
Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1408550631561_0017
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2014-08-24 11:57:24,051 Stage-0 map = 0%,  reduce = 0%
2014-08-24 11:57:32,403 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 1.96 sec
MapReduce Total cumulative CPU time: 1 seconds 960 msec
Ended Job = job_1408550631561_0017
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.96 sec   HDFS Read: 3734 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 960 msec
OK
Time taken: 20.378 seconds
hive>
    6)到Hbase中验证效果使用get、scan、filter获取指定数据
  #我们再登录到Hbase中看数据是否已经导入
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
hbase(main):013:0> get "h2b_nginx_accesslog",1
COLUMN                                                 CELL                                                                                                                                                           
 log:body_bytes_sent                                   timestamp=1408852652522, value=2373                                                                                                                            
 log:bytes_sent                                        timestamp=1408852652522, value=2683                                                                                                                            
 log:gzip_ratio                                        timestamp=1408852652522, value=2.78                                                                                                                            
 log:host                                              timestamp=1408852652522, value=8.8.8.8                                                                                                                         
 log:hostuser                                          timestamp=1408852652522, value=-                                                                                                                               
 log:http_x_forwarded_for                              timestamp=1408852652522, value=-                                                                                                                               
 log:referer                                           timestamp=1408852652522, value=-                                                                                                                               
 log:request_length                                    timestamp=1408852652522, value=369                                                                                                                             
 log:request_time                                      timestamp=1408852652522, value=0.004                                                                                                                           
 log:requesthttp                                       timestamp=1408852652522, value=HTTP/1.1                                                                                                                        
 log:requestmethond                                    timestamp=1408852652522, value=GET                                                                                                                             
 log:requesturl                                        timestamp=1408852652522, value=/                                                                                                                               
 log:status                                            timestamp=1408852652522, value=200                                                                                                                             
 log:times                                             timestamp=1408852652522, value=22/Aug/2014:20:23:45 +0800                                                                                                      
 log:upstream_addr                                     timestamp=1408852652522, value=unix:/var/run/php5-fpm.sock                                                                                                     
 log:upstream_response_time                            timestamp=1408852652522, value=0.004                                                                                                                           
 log:upstream_status                                   timestamp=1408852652522, value=200                                                                                                                             
 log:useragent                                         timestamp=1408852652522, value=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36          
18 row(s) in 0.0440 seconds
hbase(main):015:0> get "h2b_nginx_accesslog",1,{COLUMN => 'log:useragent'}
COLUMN                                                 CELL                                                                                                                                                           
 log:useragent                                         timestamp=1408852652522, value=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Safari/537.36          
1 row(s) in 0.0080 seconds
  #使用Filter过滤requesturl列中包含p04.png的记录,使用之前要先导入子串
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
hbase(main):031:0> import org.apache.hadoop.hbase.filter.CompareFilter
hbase(main):032:0> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
hbase(main):033:0> import org.apache.hadoop.hbase.filter.SubstringComparator
hbase(main):034:0> import org.apache.hadoop.hbase.util.Bytes
hbase(main):035:0> scan "h2b_nginx_accesslog",{FILTER => SingleColumnValueFilter.new(Bytes.toBytes('log'),Bytes.toBytes('requesturl'),CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new('p04.png'))}
ROW                                                    COLUMN+CELL                                                                                                                                                    
 9                                                     column=log:body_bytes_sent, timestamp=1408852652522, value=2860                                                                                                
 9                                                     column=log:bytes_sent, timestamp=1408852652522, value=3169                                                                                                     
 9                                                     column=log:gzip_ratio, timestamp=1408852652522, value=-                                                                                                        
 9                                                     column=log:host, timestamp=1408852652522, value=8.8.8.8                                                                                                        
 9                                                     column=log:hostuser, timestamp=1408852652522, value=-                                                                                                          
 9                                                     column=log:http_x_forwarded_for, timestamp=1408852652522, value=-                                                                                              
 9                                                     column=log:referer, timestamp=1408852652522, value=http://xxx.com/                                                                                             
 9                                                     column=log:request_length, timestamp=1408852652522, value=376                                                                                                  
 9                                                     column=log:request_time, timestamp=1408852652522, value=0.000                                                                                                  
 9                                                     column=log:requesthttp, timestamp=1408852652522, value=HTTP/1.1                                                                                                
 9                                                     column=log:requestmethond, timestamp=1408852652522, value=GET                                                                                                  
 9                                                     column=log:requesturl, timestamp=1408852652522, value=8.8.8.8/b519d8ca/image/p04.png                                                                           
 9                                                     column=log:status, timestamp=1408852652522, value=200                                                                                                          
 9                                                     column=log:times, timestamp=1408852652522, value=22/Aug/2014:20:23:45 +0800                                                                                    
 9                                                     column=log:upstream_addr, timestamp=1408852652522, value=-                                                                                                     
 9                                                     column=log:upstream_response_time, timestamp=1408852652522, value=-                                                                                            
 9                                                     column=log:upstream_status, timestamp=1408852652522, value=-                                                                                                   
 9                                                     column=log:useragent, timestamp=1408852652522, value=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.0 Sa
                                                       fari/537.36                                                                                                                                                    
1 row(s) in 0.0320 seconds
 
hbase(main):036:0>
  四、FAQ
  1)如果导入的内容使用select查看时,发现是null,那可能是你的正则写的有问题,可以使用下面这个工具验证你的正则是否有问题。
  #http://regex101.com,一个在线的正则表达式工具,很好用,如下图。
  #在线测试通过后,将正则表达式,复制到一个文本文件中,进行两次转义字符的替换就可以使用了,第1次,将一个右斜杠去你的成两个右斜杠:'\'替换成'\\';第二次将双引号替换成斜框双引号'"'替换成'\"'

 
  2)如果提示“org.apache.hadoop.hive.contrib.serde2.RegexSerDe”相关的错误,请在hive中执行以下命令:
1
hive> add jar /home/hjl/hive/lib/hive_contrib.jar;
 
  五、扩展阅读
 
 
  本文中提到的Java代码下载地址:http://pan.baidu.com/s/1c0hD4n2
 
---------------------------------------
强烈谴责不尊重版权,盗用本文,不写原文出处的网站:
http://www.aboutyun.com/thread-8914-1-1.html
---------------------------------
 

Nginx日志导入到Hive0.13.1,同步Hbase0.96.2,设置RowKey为autoincrement(ID自增长)的更多相关文章

  1. ubuntu12.04+hadoop2.2.0+zookeeper3.4.5+hbase0.96.2+hive0.13.1伪分布式环境部署

    目录: 一.hadoop2.2.0.zookeeper3.4.5.hbase0.96.2.hive0.13.1都是什么? 二.这些软件在哪里下载? 三.如何安装 1.安装JDK 2.用parallel ...

  2. Nginx 日志分享

    Nginx 日志对于大部分人来说是个未被发掘的宝藏,总结之前做某日志分析系统的经验,和大家分享一下 Nginx 日志的纯手工分析方式. Nginx 日志相关配置有 2 个地方:access_log 和 ...

  3. Nginx日志文件配置与切割

    Nginx日志的指令主要有两条: log_format,设置日志的格式 access_log,指定日志文件的存放路径.格式和缓存大小 两条指令在Nginx配置文件中的位置可以在http{……..}之间 ...

  4. 【转】纯手工玩转 Nginx 日志

    Nginx 日志对于大部分人来说是个未被发掘的宝藏,总结之前做某日志分析系统的经验,和大家分享一下 Nginx 日志的纯手工分析方式. Nginx 日志相关配置有 2 个地方:access_log 和 ...

  5. rsync同步Nginx日志遇到问题总结

    一.目的 将nginx 日志通过普通用户利用rsync公钥认证的方式实时同步到本地服务器上,之后使用elk程序进行处理. 二.遇到问题及解决方法思路 问题1.文件权限:nginx 的日志默认权限如下: ...

  6. 使用 Heka 导入自定义的nginx日志到Elasticsearch

    重置Heka执行进度 heka的进度配置文件存在配置项 base_dir 设置的目录,只需要删除这个文件夹下面的内容,就可以完全重置heka的进度. base_dir 配置项默认是在下面目录: '/v ...

  7. Nginx日志通过Flume导入到HDFS中

    关注公众号:分享电脑学习回复"百度云盘" 可以免费获取所有学习文档的代码(不定期更新) flume上传到hdfs: 当我们的数据量比较大时,比如每天的日志文件达到5G以上 使用ha ...

  8. Hadoop-2.2.0 + Hbase-0.96.2 + Hive-0.13.1(转)

    From:http://www.itnose.net/detail/6065872.html # 需要软件 Hadoop-2.2.0(目前Apache官网最新的Stable版本) Hbase-0.96 ...

  9. _00018 Hadoop-2.2.0 + Hbase-0.96.2 + Hive-0.13.1 分布式环境整合,Hadoop-2.X使用HA方式

    博文作者:妳那伊抹微笑 itdog8 地址链接 : http://www.itdog8.com(个人链接) 博客地址:http://blog.csdn.net/u012185296 个性签名:世界上最 ...

随机推荐

  1. static代码块与{}代码块的比较

    第一个例子: public class StaticDemo { { System.out.println("{} 代码块"); } static{ System.out.prin ...

  2. google protobuf ios开发使用

    简介: protobuf 即 google protocol buffer 是一种数据封装格式协议: 比如其他经常用的xml,json等格式:protobuf的优势是效率高,同样的一份数据使用prot ...

  3. redis 持久化 如果 AOF 文件出错了,怎么办?

    服务器可能在程序正在对 AOF 文件进行写入时停机, 如果停机造成了 AOF 文件出错(corrupt), 那么 Redis 在重启时会拒绝载入这个 AOF 文件, 从而确保数据的一致性不会被破坏. ...

  4. Java第一个程序

    public class HelloWorld{ public static void main(String[] args){ System.out.println("七期项目二部&quo ...

  5. Unity NGUI 2D场景添加按钮

    比如说先添加一个sprite 在sprite加上NGUI的 UI Button 然后重点来了  加上Box Collider 2D(重点:2D 千万不要加 Box Collider) 将Box Col ...

  6. 亲测!阿里云公共DNS,感觉不错!

    最近阿里推出了公共DNS,这对于普通的网友来说估计没什么用处,但对于我们建站人来说,确实是一个不错的消息.一听说阿里出公共DNS,博主就立马换电信的DNS换下了.经过这几天的测试,相当满意! 个人感觉 ...

  7. SQLServer更新语句要注意

    在SQLServer中 update语句中对于表不能使用别名 eg:update table a set a.column="" where ... 这样在SQLServer中是不 ...

  8. asp.net页面跳转sessionid会变

    今天发现在一个Asp.net站点中, 同一次登录, 不停刷新页面或者页面跳转, 此时后台Session的SessionID总是变化的. 创建一个页面,添加一个button,后台代码非常简单,如下   ...

  9. 用MySQL实现分页查询

    MySQL中实现分页查询语句: //定义分页需要的变量 int pageNow=2;//当前页 int pageSize=3;//指定每页显示3条记录 int pageCount=1;//该值是计算出 ...

  10. Android Spinner控件数据绑定

    Java代码