1.在HDFS文件系统的根目录下创建递归目录“1daoyun/file”,将附件中的BigDataSkills.txt文件,上传到1daoyun/file目录中,使用相关命令查看文件系统中1daoyun/file目录的文件列表信息。

答:

[root@master MapReduce]# hadoop fs -mkdir -p /1daoyun/file

[root@master MapReduce]# hadoop fs -put BigDataSkills.txt /1daoyun/file

[root@master MapReduce]# hadoop fs -ls /1daoyun/file

Found 1 items

-rw-r--r--   3 root hdfs       1175 2018-02-12 08:01 /1daoyun/file/BigDataSkills.txt

2.在HDFS文件系统的根目录下创建递归目录“1daoyun/file”,将附件中的BigDataSkills.txt文件,上传到1daoyun/file目录中,上传过程指定BigDataSkills.txt文件在HDFS文件系统中的复制因子为2,并使用fsck工具检查存储块的副本数。

答:

[root@master MapReduce]# hadoop fs -mkdir -p /1daoyun/file

[root@master MapReduce]# hadoop fs -D dfs.replication=2 -put BigDataSkills.txt /1daoyun/file

[root@master MapReduce]# hadoop fsck /1daoyun/file/BigDataSkills.txt

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Connecting to namenode via http://master.hadoop:50070/fsck?ugi=root&path=%2F1daoyun%2Ffile%2FBigDataSkills.txt

FSCK started by root (auth:SIMPLE) from /10.0.6.123 for path /1daoyun/file/BigDataSkills.txt at Mon Feb 12 08:11:47 UTC 2018

.

/1daoyun/file/BigDataSkills.txt:  Under replicated BP-297530755-10.0.6.123-1518056860260:blk_1073746590_5766. Target Replicas is 2 but found 1 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).

Status: HEALTHY

Total size: 1175 B

Total dirs: 0

Total files: 1

Total symlinks: 0

Total blocks (validated): 1 (avg. block size 1175 B)

Minimally replicated blocks: 1 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 1 (100.0 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 3

Average block replication: 1.0

Corrupt blocks: 0

Missing replicas: 1 (50.0 %)

Number of data-nodes: 1

Number of racks: 1

FSCK ended at Mon Feb 12 08:11:47 UTC 2018 in 1 milliseconds

The filesystem under path '/1daoyun/file/BigDataSkills.txt' is HEALTHY

3.HDFS文件系统的根目录下存在一个/apps的文件目录,要求开启该目录的可创建快照功能,并为该目录文件创建快照,快照名称为apps_1daoyun,使用相关命令查看该快照文件的列表信息。

答:

[hdfs@master ~]# hadoop dfsadmin -allowSnapshot /apps

Allowing snaphot on /apps succeeded

[hdfs@master ~]# hadoop fs -createSnapshot /apps apps_1daoyun

Created snapshot /apps/.snapshot/apps_1daoyun

[hdfs@master ~]# hadoop fs -ls /apps/.snapshot

Found 1 items

drwxrwxrwx   - hdfs hdfs          0 2017-05-07 09:48 /apps/.snapshot/apps_1daoyun

4.为了防止操作人员误删文件,HDFS文件系统提供了回收站的功能,但过多的垃圾文件会占用大量的存储空间。要求在Linux Shell中使用“vi”命令修改相应的配置文件以及参数信息,关闭回收站功能。完成后,重启相应的服务。

答:

[root@master ~]# vi /etc/hadoop/ 2.6.1.0-129/0/hdfs-site.xml

<property>

<name>fs.trash.interval</name>

<value>0</value>

</property>

[root@master ~]# su - hdfs

Last login: Mon May  8 09:31:52 UTC 2017

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop namenode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop datanode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start datanode

5.使用命令查看hdfs文件系统中/tmp目录下的目录个数,文件个数和文件总大小。

答:

[root@master ~]# hadoop fs -count  /tmp

21            6               4336 /tmp

6.在集群节点中/usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JAR包hadoop-mapreduce-examples.jar。运行JAR包中的wordcount程序来对/1daoyun/file/BigDataSkills.txt文件进行单词计数,将运算结果输出到/1daoyun/output目录中,使用相关命令查询单词计数结果。

答:

[root@master ~]# hadoop jar /usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.3.2.6.1.0-129.jar wordcount /1daoyun/file/BigDataSkills.txt /1daoyun/output

[root@master ~]# hadoop fs -cat /1daoyun/output/part-r-00000

"duiya  1

hello   1

nisibusisha     1

wosha"  1

zsh     1

7.在集群节点中/usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JAR包hadoop-mapreduce-examples.jar。运行JAR包中的sudoku程序来计算下表中数独运算题的结果。

8

3

6

7

9

2

5

7

4

5

7

1

3

1

6

8

8

5

1

9

4

答:

[root@master ~]# cat puzzle1.dta

8 ? ? ? ? ? ? ? ?

? ? 3 6 ? ? ? ? ?

? 7 ? ? 9 ? 2 ? ?

? 5 ? ? ? 7 ? ? ?

? ? ? ? 4 5 7 ? ?

? ? ? 1 ? ? ? 3 ?

? ? 1 ? ? ? ? 6 8

? ? 8 5 ? ? ? 1 ?

? 9 ? ? ? ? 4 ? ?

[root@master hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar sudoku /root/puzzle1.dta

WARNING: Use "yarn jar" to launch YARN applications.

Solving /root/puzzle1.dta

8 1 2 7 5 3 6 4 9

9 4 3 6 8 2 1 7 5

6 7 5 4 9 1 2 8 3

1 5 4 2 3 7 8 9 6

3 6 9 8 4 5 7 2 1

2 8 7 1 6 9 5 3 4

5 2 1 9 7 4 3 6 8

4 3 8 5 2 6 9 1 7

7 9 6 3 1 8 4 5 2

Found 1 solutions

8.在集群节点中/usr/hdp/2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JAR包hadoop-mapreduce-examples.jar。运行JAR包中的grep程序来统计文件系统中/1daoyun/file/BigDataSkills.txt文件中“Hadoop”出现的次数,统计完成后,查询统计结果信息。

答:

[root@master hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar grep /1daoyun/file/BigDataSkills.txt /output hadoop

[root@master hadoop-mapreduce]# hadoop fs -cat /output/part-r-00000

2       hadoop

9.启动先电大数据平台的Hbase数据库,其中要求使用master节点的RegionServer。在Linux Shell中启动Hbase shell,查看进入HBase shell的当前系统用户。(相关数据库命令语言请全部使用小写格式)

答:

hbase(main):003:0> whoami

root (auth:SIMPLE)

groups: root

10.开启HBase的安全认证功能,在HBase Shell中设置root用户拥有表xiandian_user的读写与执行的权限,设置完成后,使用相关命令查看其权限信息。

答:

参数 Enable Authorization

参数值 native

hbase(main):002:0> grant 'root','RWX','xiandian_user'

0 row(s) in 0.4800 seconds

hbase(main):003:0> user_permission 'xiandian_user'

User                                             Namespace,Table,Family,Qualifier:Permission

root                                            default,xiandian_user,,: [Permission: actions=READ,WRITE,EXEC]

1 row(s) in 0.1180 seconds

11. 登录hbase数据库,创建一张表为member,列族为'address','info',创建完之后,向该表插入数据,插入的数据为:

'xiandianA','info:age','24'

'xiandianA','info:birthday','1990-07-17'

'xiandianA','info:company','alibaba'

'xiandianA','address:contry','china'

'xiandianA','address:province','zhejiang'

'xiandianA','address:city','hangzhou'

插入完毕后,使用命令查询member表中xiandianA的所有info信息,最后将xiandianA的年龄改为99,并只查询info:age信息。

答:

hbase(main):001:0> create 'member','address','info'

0 row(s) in 1.5730 seconds

=> Hbase::Table - member

hbase(main):002:0> list

TABLE

emp

member

2 row(s) in 0.0240 seconds

hbase(main):007:0> put'member','xiandianA','info:age','24'

0 row(s) in 0.1000 seconds

hbase(main):008:0> put'member','xiandianA','info:birthday','1990-07-17'

0 row(s) in 0.0130 seconds

hbase(main):010:0> put'member','xiandianA','info:company','alibaba'

0 row(s) in 0.0080 seconds

hbase(main):011:0> put'member','xiandianA','address:contry','china'

0 row(s) in 0.0080 seconds

hbase(main):012:0> put'member','xiandianA','address:province','zhejiang'

0 row(s) in 0.0070 seconds

hbase(main):013:0> put'member','xiandianA','address:city','hangzhou'

0 row(s) in 0.0090 seconds

hbase(main):014:0> get 'member','xiandianA','info'

COLUMN                  CELL

info:age               timestamp=1522140592336, value=24

info:birthday          timestamp=1522140643072, value=1990-07-17

info:company           timestamp=1522140745172, value=alibaba

3 row(s) in 0.0170 seconds

hbase(main):015:0>

hbase(main):016:0* put 'member','xiandianA','info:age','99'

0 row(s) in 0.0080 seconds

hbase(main):018:0> get 'member','xiandianA','info:age'

COLUMN                  CELL

info:age               timestamp=1522141564423, value=99

1 row(s) in 0.0140 seconds

12.在关系数据库系统中,命名空间namespace是表的逻辑分组,同一组中的表有类似的用途。登录hbase数据库,新建一个命名空间叫newspace并用list查询,然后在这个命名空间中创建表member,列族为'address','info',创建完之后,向该表插入数据,插入的数据为:

'xiandianA','info:age','24'

'xiandianA','info:birthday','1990-07-17'

'xiandianA','info:company','alibaba'

'xiandianA','address:contry','china'

'xiandianA','address:province','zhejiang'

'xiandianA','address:city','hangzhou'

插入完毕后,使用scan命令只查询表中info:age的信息,指定startrow为xiandianA。

答:

hbase(main):022:0> create_namespace 'newspace'

0 row(s) in 0.1130 seconds

hbase(main):024:0> list

TABLE

emp

member

newspace:member

3 row(s) in 0.0100 seconds

=> ["emp", "member", "newspace:member"]

hbase(main):023:0> create 'newspace:member','address','info'

0 row(s) in 1.5270 seconds

hbase(main):033:0> put 'newspace:member','xiandianA','info:age','24'

0 row(s) in 0.0620 seconds

hbase(main):037:0> put 'newspace:member','xiandianA','info:birthday','1990-07-17'

0 row(s) in 0.0110 seconds

hbase(main):038:0> put 'newspace:member','xiandianA','info:company','alibaba'

0 row(s) in 0.0130 seconds

hbase(main):039:0> put 'newspace:member','xiandianA','address:contry','china'

0 row(s) in 0.0070 seconds

hbase(main):040:0> put 'newspace:member','xiandianA','address:province','zhejiang'

0 row(s) in 0.0070 seconds

hbase(main):041:0> put 'newspace:member','xiandianA','address:city','hangzhou'

0 row(s) in 0.0070 seconds

hbase(main):044:0> scan 'newspace:member', {COLUMNS => ['info:age'],STARTROW => 'xiandianA'}

ROW                                              COLUMN+CELL

xiandianA                                       column=info:age, timestamp=1522214952401, value=24

1 row(s) in 0.0160 seconds

13.登录master节点,在本地新建一个文件叫hbasetest.txt文件,编写内容,要求新建一张表为'test', 列族为'cf',然后向这张表批量插入数据,数据如下所示:

'row1', 'cf:a', 'value1'

'row2', 'cf:b', 'value2'

'row3', 'cf:c', 'value3'

'row4', 'cf:d', 'value4'

在插入数据完毕后用scan命令查询表内容,然后用get命令只查询row1的内容,最后退出hbase shell。使用命令运行hbasetest.txt,将hbasetest.txt的内容和执行命令后的返回结果提交。

答:

[root@exam1 ~]# cat hbasetest.txt

create 'test', 'cf'

list 'test'

put 'test', 'row1', 'cf:a', 'value1'

put 'test', 'row2', 'cf:b', 'value2'

put 'test', 'row3', 'cf:c', 'value3'

put 'test', 'row4', 'cf:d', 'value4'

scan 'test'

get 'test', 'row1'

exit

[root@exam1 ~]# hbase shell hbasetest.txt

0 row(s) in 1.5010 seconds

TABLE

test

1 row(s) in 0.0120 seconds

0 row(s) in 0.1380 seconds

0 row(s) in 0.0090 seconds

0 row(s) in 0.0050 seconds

0 row(s) in 0.0050 seconds

ROW                     COLUMN+CELL

row1                   column=cf:a, timestamp=1522314428726, value=value1

row2                   column=cf:b, timestamp=1522314428746, value=value2

row3                   column=cf:c, timestamp=1522314428752, value=value3

row4                   column=cf:d, timestamp=1522314428758, value=value4

4 row(s) in 0.0350 seconds

COLUMN                  CELL

cf:a                   timestamp=1522314428726, value=value1

1 row(s) in 0.0190 seconds

14.使用Hive工具来创建数据表xd_phy_course,并定义该表为外部表,外部存储位置为/1daoyun/data/hive,将phy_course_xd.txt导入到该表中,其中xd_phy_course表的数据结构如下表所示。导入完成后,在hive中查询数据表xd_phy_course的数据结构信息。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

hive> create external table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n' location '/1daoyun/data/hive';

OK

Time taken: 1.197 seconds

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course;

Loading data to table default.xd_phy_course

Table default.xd_phy_course stats: [numFiles=1, totalSize=89444]

OK

Time taken: 0.96 seconds

hive> desc xd_phy_course2;

OK

stname                  string

stid                    int

class                   string

opt_cour                string

Time taken: 0.588 seconds, Fetched: 4 row(s)

15.使用Hive工具来统计phy_course_xd.txt文件中某高校报名选修各个体育科目的总人数,其中phy_course_xd.txt文件数据结构如下表所示,选修科目字段为opt_cour,将统计的结果导入到表phy_opt_count中,通过SELECT语句查询表phy_opt_count内容。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

hive> create table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 4.067 seconds

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course;

Loading data to table default.xd_phy_course

Table default.xd_phy_course stats: [numFiles=1, totalSize=89444]

OK

Time taken: 1.422 seconds

hive> create table phy_opt_count (opt_cour string,cour_count int) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 1.625 seconds

hive> insert overwrite table phy_opt_count select xd_phy_course.opt_cour,count(distinct xd_phy_course.stID) from xd_phy_course group by xd_phy_course.opt_cour;

Query ID = root_20170507125642_6af22d21-ae88-4daf-a346-4b1cbcd7d9fe

Total jobs = 1

Launching Job 1 out of 1

Tez session was closed. Reopening...

Session re-established.

Status: Running (Executing on YARN cluster with App id application_1494149668396_0004)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.51 s

--------------------------------------------------------------------------------

Loading data to table default.phy_opt_count

Table default.phy_opt_count stats: [numFiles=1, numRows=10, totalSize=138, rawDataSize=128]

OK

Time taken: 13.634 seconds

hive> select * from phy_opt_count;

OK

badminton       234

basketball      224

football        206

gymnastics      220

opt_cour        0

swimming        234

table tennis    277

taekwondo       222

tennis  223

volleyball      209

Time taken: 0.065 seconds, Fetched: 10 row(s)

16.使用Hive工具来统计phy_course_score_xd.txt文件中某高校各个班级体育课的平均成绩,使用round函数保留两位小数。其中phy_course_score_xd.txt文件数据结构如下表所示,班级字段为class,成绩字段为score。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

score(float)

答:

hive> create table phy_course_score_xd (stname string,stID int,class string,opt_cour string,score float) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.339 seconds

hive> load data local inpath '/root/phy_course_score_xd.txt' into table phy_course_score_xd;

Loading data to table default.phy_course_score_xd

Table default.phy_course_score_xd stats: [numFiles=1, totalSize=1910]

OK

Time taken: 1.061 seconds

hive> select class,round(avg(score)) from phy_course_score_xd group by class;

Query ID = root_20170507131823_0bfb1faf-3bfb-42a5-b7eb-3a6a284081ae

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1494149668396_0005)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 26.68 s

--------------------------------------------------------------------------------

OK

Network_1401    73.0

Software_1403   72.0

class   NULL

Time taken: 27.553 seconds, Fetched: 3 row(s)

17.使用Hive工具来统计phy_course_score_xd.txt文件中某高校各个班级体育课的最高成绩。其中phy_course_score_xd.txt文件数据结构如下表所示,班级字段为class,成绩字段为score。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

score(float)

答:

hive> create table phy_course_score_xd (stname string,stID int,class string,opt_cour string,score float) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.339 seconds

hive> load data local inpath '/root/phy_course_score_xd.txt' into table phy_course_score_xd;

Loading data to table default.phy_course_score_xd

Table default.phy_course_score_xd stats: [numFiles=1, totalSize=1910]

OK

Time taken: 1.061 seconds

hive> select class,max(score) from phy_course_score_xd group by class;

Query ID = root_20170507131942_86a2bf55-49ac-4c2e-b18b-8f63191ce349

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1494149668396_0005)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.08 s

--------------------------------------------------------------------------------

OK

Network_1401    95.0

Software_1403   100.0

class   NULL

Time taken: 144.035 seconds, Fetched: 3 row(s)

18.在Hive数据仓库将网络日志weblog_entries.txt中分开的request_date和request_time字段进行合并,并以一个下划线“_”进行分割,如下图所示,其中weblog_entries.txt的数据结构如下表所示。(相关数据库命令语言请全部使用小写格式)

md5(STRING)

url(STRING)

request_date (STRING)

request_time (STRING)

ip(STRING)

答:

hive> create table weblog_entries (md5 string,url string,request_date string,request_time string,ip string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.502 seconds

hive> load data local inpath '/root/weblog_entries.txt' into table weblog_entries;

Loading data to table default.weblog_entries

Table default.weblog_entries stats: [numFiles=1, totalSize=251130]

OK

Time taken: 1.203 seconds

hive> select concat_ws('_', request_date, request_time) from weblog_entries;

2012-05-10_21:29:01

2012-05-10_21:13:47

2012-05-10_21:12:37

2012-05-10_21:34:20

2012-05-10_21:27:00

2012-05-10_21:33:53

2012-05-10_21:10:19

2012-05-10_21:12:05

2012-05-10_21:25:58

2012-05-10_21:34:28

Time taken: 0.265 seconds, Fetched: 3000 row(s)

19. 使用Hive动态地关于网络日志weblog_entries.txt的查询结果创建Hive表。通过创建一张名为weblog_entries_url_length的新表来定义新的网络日志数据库的三个字段,分别是url,request_date,request_time。此外,在表中定义一个获取url字符串长度名为“url_length”的新字段,其中weblog_entries.txt的数据结构如下表所示。完成后查询weblog_entries_url_length表文件内容。(相关数据库命令语言请全部使用小写格式)

md5(STRING)

url(STRING)

request_date (STRING)

request_time (STRING)

ip(STRING)

答:

hive> create table weblog_entries (md5 string,url string,request_date string,request_time string,ip string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.502 seconds

hive> load data local inpath '/root/weblog_entries.txt' into table weblog_entries;

Loading data to table default.weblog_entries

Table default.weblog_entries stats: [numFiles=1, totalSize=251130]

OK

Time taken: 1.203 seconds

hive> create table weblog_entries_url_length as select url, request_date, request_time, length(url) as url_length from weblog_entries;

Query ID = root_20170507065123_e3105d8b-84b6-417f-ab58-21ea15723e0a

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1494136863427_0002)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 4.10 s

--------------------------------------------------------------------------------

Moving data to: hdfs://master:8020/apps/hive/warehouse/weblog_entries_url_length

Table default.weblog_entries_url_length stats: [numFiles=1, numRows=3000, totalSize=121379, rawDataSize=118379]

OK

Time taken: 5.874 seconds

hive> select * from weblog_entries_url_length;

/qnrxlxqacgiudbtfggcg.html      2012-05-10      21:29:01        26

/sbbiuot.html   2012-05-10      21:13:47        13

/ofxi.html      2012-05-10      21:12:37        10

/hjmdhaoogwqhp.html     2012-05-10      21:34:20        19

/angjbmea.html  2012-05-10      21:27:00        14

/mmdttqsnjfifkihcvqu.html       2012-05-10      21:33:53        25

/eorxuryjadhkiwsf.html  2012-05-10      21:10:19        22

/e.html 2012-05-10      21:12:05        7

/khvc.html      2012-05-10      21:25:58        10

/c.html 2012-05-10      21:34:28        7

Time taken: 0.08 seconds, Fetched: 3000 row(s)

20.在master和slaver节点安装Sqoop Clients,完成后,在master节点查看Sqoop的版本信息。

答:

[root@master ~]# sqoop version

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 06:56:25 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

Sqoop 1.4.6.2.4.3.0-227

git commit id d296ad374bd38a1c594ef0f5a2d565d71e798aa6

Compiled by jenkins on Sat Sep 10 00:58:52 UTC 2016

21.使用Sqoop工具列出master节点中MySQL中ambari数据库中所有的数据表。

答:

[root@master ~]# sqoop list-tables --connect jdbc:mysql://localhost/ambari --username root --password bigdata

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 07:07:01 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

17/05/07 07:07:01 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/05/07 07:07:02 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

ClusterHostMapping

QRTZ_BLOB_TRIGGERS

QRTZ_CALENDARS

QRTZ_CRON_TRIGGERS

QRTZ_FIRED_TRIGGERS

QRTZ_JOB_DETAILS

QRTZ_LOCKS

QRTZ_PAUSED_TRIGGER_GRPS

QRTZ_SCHEDULER_STATE

QRTZ_SIMPLE_TRIGGERS

QRTZ_SIMPROP_TRIGGERS

QRTZ_TRIGGERS

adminpermission

adminprincipal

adminprincipaltype

adminprivilege

adminresource

adminresourcetype

alert_current

alert_definition

alert_group

alert_group_target

alert_grouping

alert_history

alert_notice

alert_target

alert_target_states

ambari_sequences

artifact

blueprint

blueprint_configuration

clusterEvent

cluster_version

clusterconfig

clusterconfigmapping

clusters

clusterservices

clusterstate

confgroupclusterconfigmapping

configgroup

configgrouphostmapping

execution_command

groups

hdfsEvent

host_role_command

host_version

hostcomponentdesiredstate

hostcomponentstate

hostconfigmapping

hostgroup

hostgroup_component

hostgroup_configuration

hosts

hoststate

job

kerberos_descriptor

kerberos_principal

kerberos_principal_host

key_value_store

mapreduceEvent

members

metainfo

repo_version

request

requestoperationlevel

requestresourcefilter

requestschedule

requestschedulebatchrequest

role_success_criteria

servicecomponentdesiredstate

serviceconfig

serviceconfighosts

serviceconfigmapping

servicedesiredstate

stack

stage

task

taskAttempt

topology_host_info

topology_host_request

topology_host_task

topology_hostgroup

topology_logical_request

topology_logical_task

topology_request

upgrade

upgrade_group

upgrade_item

users

viewentity

viewinstance

viewinstancedata

viewinstanceproperty

viewmain

viewparameter

viewresource

widget

widget_layout

widget_layout_user_widget

workflow

22.在MySQL中创建名为xiandian的数据库,在xiandian数据库中创建xd_phy_course数据表,其数据表结构如表1所示。使用Hive工具来创建数据表xd_phy_course,将phy_course_xd.txt导入到该表中,其中xd_phy_course表的数据结构如表2所示。使用Sqoop工具将hive数据仓库中的xd_phy_course表导出到master节点的MySQL中xiandain数据库的xd_phy_course表。

表1

stname VARCHAR(20)

stID INT(1)

class VARCHAR(20)

opt_cour VARCHAR(20)

表2

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

[root@master ~]# mysql -uroot -pbigdata

Welcome to the MariaDB monitor.  Commands end with ; or \g.

Your MariaDB connection id is 37

Server version: 5.5.44-MariaDB MariaDB Server

Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> create database xiandian;

Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use xiandian;

Database changed

MariaDB [xiandian]> create table xd_phy_course(stname varchar(20),stID int(1),class varchar(20),opt_cour varchar(20));

Query OK, 0 rows affected (0.20 sec)

hive> create table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 3.136 seconds

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course3;

Loading data to table default.xd_phy_course3

Table default.xd_phy_course3 stats: [numFiles=1, totalSize=89444]

OK

Time taken: 1.129 seconds

[root@master ~]# sqoop export --connect jdbc:mysql://localhost:3306/xiandian --username root --password bigdata --table xd_phy_course  --hcatalog-table xd_phy_course

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 07:29:48 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

17/05/07 07:29:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/05/07 07:29:48 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

17/05/07 07:29:48 INFO tool.CodeGenTool: Beginning code generation

17/05/07 07:29:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `xd_phy_course` AS t LIMIT 1

17/05/07 07:29:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `xd_phy_course` AS t LIMIT 1

17/05/07 07:29:48 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.3.0-227/hadoop-mapreduce

Note: /tmp/sqoop-root/compile/35d4b31b4d93274ba6bde54b3e56a821/xd_phy_course.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

17/05/07 07:29:50 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/35d4b31b4d93274ba6bde54b3e56a821/xd_phy_course.jar

17/05/07 07:29:50 INFO mapreduce.ExportJobBase: Beginning export of xd_phy_course

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

23.使用Pig工具在Local模式计算系统日志access-log.txt中的IP的点击数,要求使用GROUP BY语句按照IP进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的总行数,最后使用DUMP语句查询统计结果。

答:

grunt> copyFromLocal /root/Pig/access-log.txt /user/root/input/log1.txt

grunt> A =LOAD '/user/root/input/log1.txt' USING PigStorage (' ') AS (ip,others);

grunt> group_ip =group A by ip;

grunt> result =foreach group_ip generate group,COUNT(A);

grunt> dump result;

2018-02-13 08:13:36,520 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.7.3.2.6.1.0-129 0.16.0.2.6.1.0-129 root 2018-02-13 08:13:37 2018-02-13 08:13:41 GROUP_BY

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_local963723433_0001 1 1 n/a n/a n/a n/a n/a n/a n/a n/a A,group_ip,result GROUP_BY,COMBINER file:/tmp/temp-1479363025/tmp133834330,

Input(s):

Successfully read 62991 records from: "/user/root/input/log1.txt"

Output(s):

Successfully stored 182 records in: "file:/tmp/temp-1479363025/tmp133834330"

(220.181.108.186,1)

(222.171.234.225,142)

(http://www.1daoyun.com/course/toregeister",1)

24.使用Pig工具计算天气数据集temperature.txt中年度最高气温,要求使用GROUP BY语句按照year进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的最大值,最后使用DUMP语句查询计算结果。

答:

grunt> copyFromLocal /root/Pig/temperature.txt /user/root/temp.txt

grunt> A = LOAD '/user/root/temp.txt' USING PigStorage(' ')AS (year:int,temperature:int);

grunt> B = GROUP A BY year;

grunt> C = FOREACH B GENERATE group,MAX(A.temperature);

grunt> dump C;

2018-02-13 08:18:52,107 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

(2012,40)

(2013,36)

(2014,37)

(2015,39)

25.使用Pig工具统计数据集ip_to_country中每个国家的IP地址数。要求使用GROUP BY语句按照国家进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的IP地址数目,最后将统计结果保存到/data/pig/output目录中,并查看数据结果。

答:

grunt> copyFromLocal /root/Pig/ip_to_country.txt /user/root/ip_to_country.txt

grunt> ip_countries = LOAD '/user/root/ip_to_country.txt' AS (ip: chararray, country:chararray);

grunt> country_grpd = GROUP ip_countries BY country;

grunt> country_counts = FOREACH country_grpd GENERATE FLATTEN(group),COUNT(ip_countries) as counts;

grunt> STORE country_counts INTO '/data/pig/output';

2018-02-13 08:23:35,621 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

Moldova, Republic of 1

Syrian Arab Republic 1

United Arab Emirates 2

Bosnia and Herzegovina 1

Iran, Islamic Republic of 2

Tanzania, United Republic of 1

26.在master节点安装Mahout Client,打开Linux Shell运行mahout命令查看Mahout自带的案例程序。

答:

[root@master ~]# mahout

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf

MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar

An example program must be given as the first argument.

Valid program names are:

arff.vector: : Generate Vectors from an ARFF file or directory

baumwelch: : Baum-Welch algorithm for unsupervised HMM training

buildforest: : Build the random forest classifier

canopy: : Canopy clustering

cat: : Print a file or resource as the logistic regression models would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text

clusterpp: : Groups Clustering Output In Clusters

cmdump: : Dump confusion matrix in HTML or text formats

concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix

cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)

cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.

describe: : Describe the fields and target variable in a data set

evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes

fkmeans: : Fuzzy K-means clustering

hmmpredict: : Generate random sequence of observations by given HMM

itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

kmeans: : K-means clustering

lucene.vector: : Generate Vectors from a Lucene index

lucene2seq: : Generate Text SequenceFiles from a Lucene index

matrixdump: : Dump matrix in CSV format

matrixmult: : Take the product of two matrices

parallelALS: : ALS-WR factorization of a rating matrix

qualcluster: : Runs clustering experiments and summarizes results in a CSV

recommendfactorized: : Compute recommendations using the factorization of a rating matrix

recommenditembased: : Compute recommendations using item-based collaborative filtering

regexconverter: : Convert text files on a per line basis based on regular expressions

resplit: : Splits a set of SequenceFiles into a number of equal splits

rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}

rowsimilarity: : Compute the pairwise similarities of the rows of a matrix

runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model

runlogistic: : Run a logistic regression model against CSV data

seq2encoded: : Encoded Sparse Vector generation from Text sequence files

seq2sparse: : Sparse Vector generation from Text sequence files

seqdirectory: : Generate sequence files (of Text) from a directory

seqdumper: : Generic Sequence File dumper

seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives

seqwiki: : Wikipedia xml dump to sequence file

spectralkmeans: : Spectral k-means clustering

split: : Split Input data into test and train sets

splitDataset: : split a rating dataset into training and probe parts

ssvd: : Stochastic SVD

streamingkmeans: : Streaming k-means clustering

svd: : Lanczos Singular Value Decomposition

testforest: : Test the random forest classifier

testnb: : Test the Vector-based Bayes classifier

trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model

trainlogistic: : Train a logistic regression using stochastic gradient descent

trainnb: : Train the Vector-based Bayes classifier

transpose: : Take the transpose of a matrix

validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set

vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors

vectordump: : Dump vectors from a sequence file to text

viterbi: : Viterbi decoding of hidden states from given output states sequence

27.使用Mahout挖掘工具对数据集user-item-score.txt(用户-物品-得分)进行物品推荐,要求采用基于项目的协同过滤算法,欧几里得距离公式定义,并且每位用户的推荐个数为3,设置非布尔数据,最大偏好值为4,最小偏好值为1,将推荐输出结果保存到output目录中,通过-cat命令查询输出结果part-r-00000中的内容 。

答:

[hdfs@master ~]$ hadoop fs -mkdir -p /data/mahout/project

[hdfs@master ~]$ hadoop fs -put user-item-score.txt /data/mahout/project

[hdfs@master ~]$ mahout recommenditembased -i /data/mahout/project/ user-item-score.txt -o /data/mahout/project/output -n 3 -b false -s SIMILARITY_EUCLIDEAN_DISTANCE --maxPrefsPerUser 4 --minPrefsPerUser 1 --maxPrefsInItemSimilarity 4 --tempDir /data/mahout/project/temp

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using /usr/hdp/2.4.3.0-227/hadoop/bin/hadoop and

17/05/15 19:37:25 INFO driver.MahoutDriver: Program took 259068 ms (Minutes: 4.3178)

[hdfs@master ~]$ hadoop fs -cat /data/mahout/project/output/part-r-00000

1       [105:3.5941463,104:3.4639049]

2       [106:3.5,105:2.714964,107:2.0]

3       [103:3.59246,102:3.458911]

4       [107:4.7381864,105:4.2794304,102:4.170158]

5       [103:3.8962872,102:3.8564017,107:3.7692602]

28.在master节点安装启动Flume组件,打开Linux Shell运行flume-ng的帮助命令,查看Flume-ng的用法信息。

答:

[root@master ~]# flume-ng help

Usage: /usr/hdp/2.6.1.0-129/flume/bin/flume-ng.distro <command> [options]...

commands:

help                  display this help text

agent                 run a Flume agent

avro-client           run an avro Flume client

password              create a password file for use in flume config

version               show Flume version info

global options:

--conf,-c <conf>      use configs in <conf> directory

--classpath,-C <cp>   append to the classpath

--dryrun,-d           do not actually start Flume, just print the command

--plugins-path <dirs> colon-separated list of plugins.d directories. See the

plugins.d section in the user guide for more details.

Default: $FLUME_HOME/plugins.d

-Dproperty=value      sets a Java system property value

-Xproperty=value      sets a Java -X option

agent options:

--conf-file,-f <file> specify a config file (required)

--name,-n <name>      the name of this agent (required)

--help,-h             display help text

avro-client options:

--rpcProps,-P <file>   RPC client properties file with server connection params

--host,-H <host>       hostname to which events will be sent

--port,-p <port>       port of the avro source

--dirname <dir>        directory to stream to avro source

--filename,-F <file>   text file to stream to avro source (default: std input)

--headerFile,-R <file> File containing event headers as key/value pairs on each new line

--help,-h              display help text

Either --rpcProps or both --host and --port must be specified.

password options:

--outfile              The file in which encoded password is stored

Note that if <conf> directory is specified, then it is always included first

in the classpath.

29. 根据提供的模板hdfs-example.conf文件,使用Flume NG工具设置master节点的系统路径/opt/xiandian/为实时上传文件至HDFS文件系统的实时路径,设置HDFS文件系统的存储路径为/data/flume/,上传后的文件名保持不变,文件类型为DataStream,然后启动flume-ng agent。

答:

[root@master ~]# flume-ng agent --conf-file hdfs-example.conf --name master -Dflume.root.logger=INFO,cnsole

Warning: No configuration directory set! Use --conf <dir> to override.

Info: Including Hadoop libraries found via (/bin/hadoop) for HDFS access

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/tez/lib/slf4j-api-1.7.5.jar from classpath

Info: Including HBASE libraries found via (/bin/hbase) for HBASE access

Info: Excluding /usr/hdp/2.4.3.0-227/hbase/lib/slf4j-api-1.7.7.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/tez/lib/slf4j-api-1.7.5.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-api-1.6.1.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-log4j12-1.6.1.jar from classpath

Info: Including Hive libraries found via () for Hive access

[root@master ~]# cat hdfs-example.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent

master.sources = webmagic

master.sinks = k1

master.channels = c1

# Describe/configure the source

master.sources.webmagic.type = spooldir

master.sources.webmagic.fileHeader = true

master.sources.webmagic.fileHeaderKey = fileName

master.sources.webmagic.fileSuffix = .COMPLETED

master.sources.webmagic.deletePolicy = never

master.sources.webmagic.spoolDir = /opt/xiandian/

master.sources.webmagic.ignorePattern = ^$

master.sources.webmagic.consumeOrder = oldest

master.sources.webmagic.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

master.sources.webmagic.batchsize = 5

master.sources.webmagic.channels = c1

# Use a channel which buffers events in memory

master.channels.c1.type = memory

# Describe the sink

master.sinks.k1.type = hdfs

master.sinks.k1.channel = c1

master.sinks.k1.hdfs.path = hdfs://master:8020/data/flume/%{dicName}

master.sinks.k1.hdfs.filePrefix = %{fileName}

master.sinks.k1.hdfs.fileType = DataStream

30.在先电大数据平台部署Spark服务组件,打开Linux Shell启动spark-shell终端,将启动的程序进程信息提交。

答:

[root@master ~]# spark-shell

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://172.24.2.110:4040

Spark context available as 'sc' (master = local[*], app id = local-1519375873795).

Spark session available as 'spark'.

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 2.1.1.2.6.1.0-129

/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

31.登录spark-shell,定义i值为1,sum值为0,使用while循环,求从1加到100的值,最后使用scala的标准输出函数输出sum值。

答:

scala> var i=1

i: Int = 1

scala> var sum=0

sum: Int = 0

scala> while(i<=100){

| sum+=i

| i=i+1

| }

scala> println(sum)

5050

32.登录spark-shell,定义一个list为(1,2,3,4,5,6,7,8,9),然后利用map函数,对这个list进行元素乘2的操作。

答:

scala> import scala.math._

import scala.math._

scala> val nums=List(1,2,3,4,5,6,7,8,9)

nums: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> nums.map(x=>x*2)

res18: List[Int] = List(2, 4, 6, 8, 10, 12, 14, 16, 18)

33.登录spark-shell,定义一个list为("Hadoop","Java","Spark"),然后利用flatmap函数将list转换为单个字母并转换为大写。

答:

scala> val data = List("Hadoop","Java","Spark")

data: List[String] = List(Hadoop, Java, Spark)

scala> println(data.flatMap(_.toUpperCase))

List(H, A, D, O, O, P, J, A, V, A, S, P, A, R, K)

34.登录大数据云主机master节点,在root目录下新建一个abc.txt,内容为:

hadoop  hive

solr    redis

kafka   hadoop

storm   flume

sqoop   docker

spark   spark

hadoop  spark

elasticsearch   hbase

hadoop  hive

spark   hive

hadoop  spark

然后登录spark-shell,首先使用命令统计abc.txt的行数,接着对abc.txt文档中的单词进行计数,并按照单词首字母的升序进行排序,最后统计结果行数。

答:

scala> val words=sc.textFile("file:///root/abc.txt").count

words: Long = 11

scala> val words=sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).sortByKey().collect

words: Array[(String, Int)] = Array((docker,1), (elasticsearch,1), (flume,1), (hadoop,5), (hbase,1), (hive,3), (kafka,1), (redis,1), (solr,1), (spark,5), (sqoop,1), (storm,1))

scala> val words=sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).count

words: Long = 12

35. 登录spark-shell,定义一个List(1,2,3,3,4,4,5,5,6,6,6,8,9),使用spark自带函数对这个list进行去重操作。

答:

scala> val l = List(1,2,3,3,4,4,5,5,6,6,6,8,9)

l: List[Int] = List(1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 8, 9)

scala> l.distinct

res1: List[Int] = List(1, 2, 3, 4, 5, 6, 8, 9)

大数据(bigdata)练习题的更多相关文章

  1. 互联网+大数据解决方案(ppt)

    from: 互联网+大数据解决方案(ppt) 导读:大数据(bigdata),或称巨量资料,指的是所涉及的资料量规模巨大到无法透过目前主流软件工具,在合理时间内达到撷取.管理.处理.并整理成为帮助企业 ...

  2. 《一张图看懂华为云BigData Pro鲲鹏大数据解决方案》

    8月27日,华为云重磅发布了业界首个鲲鹏大数据解决方案--BigData Pro.该方案采用基于公有云的存储与计算分离架构,以可无限弹性扩容的鲲鹏算力作为计算资源,以支持原生多协议的OBS对象存储服务 ...

  3. 华为云BigData Pro解读: 鲲鹏云容器助力大数据破茧成蝶

    华为云鲲鹏云容器 见证BigData Pro蝶变之旅大数据之路顺应人类科技的进步而诞生,一直顺风顺水,不到20年时间,已渗透到社会生产和人们生活的方方面面,.然而,伴随着信息量的指数级增长,大数据也开 ...

  4. 大数据学习之BigData常用算法和数据结构

    大数据学习之BigData常用算法和数据结构 1.Bloom Filter     由一个很长的二进制向量和一系列hash函数组成     优点:可以减少IO操作,省空间     缺点:不支持删除,有 ...

  5. 【原创】Thinking in BigData (1)大数据简介

    提到大数据,就不得不提到Hadoop,提到Hadoop,就不得不提到Google公布的3篇研究论文:GFS.MapReduce.BigTable,Google确实是一家伟大的公司,开启了全球的大数据时 ...

  6. BigData:值得了解的十大数据发展趋势

    当今,世界无时无刻不在发生着变化.对于技术领域而言,普遍存在的一个巨大变化就是为大数据(Big data)打开了大门,并应用大数据技相关技术来改善各行业的业务并促进经济的发展.目前,大数据的作用已经上 ...

  7. 开源分布式计算引擎 & 开源搜索引擎 Iveely 0.5.0 为大数据而生

    Iveely Computing 产生背景 08年的时候,我开始接触搜索引擎,当时遇到的第一个难题就是大数据实时并发处理,当时实验室的机器我们可以随便用,至少二三十台机器,可以,却没有程序可以将这些机 ...

  8. [Hadoop 周边] Hadoop和大数据:60款顶级大数据开源工具(2015-10-27)【转】

    说到处理大数据的工具,普通的开源解决方案(尤其是Apache Hadoop)堪称中流砥柱.弗雷斯特调研公司的分析师Mike Gualtieri最近预测,在接下来几年,“100%的大公司”会采用Hado ...

  9. [Hadoop 周边] 浅谈大数据(hadoop)和移动开发(Android、IOS)开发前景【转】

    原文链接:http://www.d1net.com/bigdata/news/345893.html 先简单的做个自我介绍,我是云6期的,黑马相比其它培训机构的好偶就不在这里说,想比大家都比我清楚: ...

随机推荐

  1. Property or method "openPageOffice" is not defined on the instance but referenced during render. Make sure that this property is reactive, either in the data option, or for class-based components, by

    Property or method "openPageOffice" is not defined on the instance but referenced during r ...

  2. 19、属性赋值-@PropertySource加载外部配置文件

    19.属性赋值-@PropertySource加载外部配置文件 加载外部配置文件的注解 19.1 [xml] 在原先的xml 中需要 导入context:property-placeholder 声明 ...

  3. FileInputStream读取的两种方法:逐字节读;以字节数组读取

    1:read() : 从输入流中读取数据的下一个字节,返回0到255范围内的int字节值.如果因为已经到达流末尾而没有可用的字节,则返回-1.在输入数据可用.检测到流末尾或者抛出异常前,此方法一直阻塞 ...

  4. mkdir/rmdir/install/mktemp

    mkdir rmdir很有趣,如果加上p选项,如果删除空目录后,其父目录是空,则一并删除,所以如果都是空的,那么就会全家删 a用户不能修改b用户的文件,但是却可以删除 install 创建文件并赋权 ...

  5. 以下是Direct 3d的安装步骤

    安装配置 真的是软肋 o( ̄ε ̄*)   我记录以下 步骤 防止下次忘记了 首先要安装到direct3d 之后在vs上配置  如下: 找到 安装direct3d的文件夹 复制路径(如下 我的路径为 G ...

  6. php安装扩展的地址

    1 查看扩展 phpinfo  or extention_loads  or php -m 下载扩展地址 http://pecl.php.net     or http://windows.php.n ...

  7. anaconda安装和配置和基本使用

    conda是个商业化公司,所以没有授权不能随便建立其镜像.虽说说的是发邮件询问基本上就能够拿到授权,然而现实是国内的各大开源镜像站都拿不到. 这个事情最近有进展了. 清华大学的镜像源已经拿到授权了 ( ...

  8. nodejs中http服务器,如何使用GET,POST请求发送数据、npm、以及一些插件的介绍

    浏览器给服务器传递参数,最常用的是地址栏传参(get),以及表单提交(post) 先说get传参,就是在url后跟上?key=value&key2=value2...... 但是按照前几篇的h ...

  9. web软件测试基础系统测试简化理论

    系统测试点主要如下 1.系统测试基础-2.测试对象与测试级别-3.系统测试类型-4.系统测试方法-5.系统测试之软件测试质量. 1.系统测试:是尽可能彻底地检查出程序中的错误,提高软件系统的可靠性. ...

  10. pom.xml报错 : Missing artifact org.apache.shiro:shiro-spring:bundle:1.2.5

    添加有<type>bundle</type>标签的依赖时,都会报这个错. 需要在<build/><plugins/>里面追加标签 <plugin& ...