Apache Ranger系列七：Hive 和 Spark 执行过程中的文件路径配置

背景：在使用Ranger鉴权的过程中，要求必须开启impersonation功能（即执行用户与提交用户保持一致，而不是统一代理的hive/spark）。但是在执行的过程中，会需要在hdfs存储临时的文件，此时容易出现权限不足的问题。对此，我们需要关注这些路径的生成/使用规则。

路径分析

报错异常的日志来自DagUtils.java，这里我们没有粘贴完全

ERROR : Failed to execute tez graph.

Caused by: org.apache.hadoop.ipc.RemoteException: Permission denied: user=your_name, access=WRITE, inode="/user":hdfs:hdfsadmingroup:drwxr-xr-x

	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399)

	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:255)

	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:193)

	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1879)

	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1863)

分析源码

UserGroupInformation ugi = Utils.getUGI();

String userName = ugi.getShortUserName();

String userPathStr = HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_USER_INSTALL_DIR);

Path userPath = new Path(userPathStr);

FileSystem fs = userPath.getFileSystem(conf);

String jarPathStr = userPathStr + "/" + userName;

由于没有显示指定HiveConf.ConfVars.HIVE_USER_INSTALL_DIR

HIVE_JAR_DIRECTORY("hive.jar.directory", null,

        "This is the location hive in tez mode will look for to find a site wide \n" +

        "installed hive instance."),

HIVE_USER_INSTALL_DIR("hive.user.install.directory", "/user/",

        "If hive (in tez mode only) cannot find a usable hive jar in \"hive.jar.directory\", \n" +

        "it will upload the hive jar to \"hive.user.install.directory/user.name\"\n" +

        "and use it to run queries."),

jarPathStr = "/user/user.name"

https://cwiki.apache.org/confluence/display/hive/configuration+properties#ConfigurationProperties-hive.user.install.directory

与官方文档描述的一致：

hive.user.install.directory
- Default Value: hdfs:///user/
- Added In: Hive 0.13.0 with HIVE-5003 and HIVE-6098
If Hive (in Tez mode only) cannot find a usable Hive jar in hive.jar.directory, it will upload the Hive jar to <hive.user.install.directory>/<user_name> and use it to run queries.

同理，spark在执行时会将文件上传到hdfs的.sparkStaging/applicationId目录下

hdfs://yourcluster/user/your_username

与官方文档描述的一致：https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties

spark.yarn.stagingDir Current user's home directory in the filesystem Staging directory used while submitting applications.

路径配置

hadoop fs -mkdir -p /user/ranger/hive/

hadoop fs -chmod -R 777 /user/ranger/hive/

hadoop fs -mkdir -p /user/ranger/spark/staging/

hadoop fs -chmod -R 777 /user/ranger/spark/staging/

hive-site.xml里的 hive.user.install.directory 参数，定义了HDFS的路径

sudo vi /etc/hive/conf/hive-site.xml，增加下面的内容

<property>

  <name>hive.user.install.directory</name>

  <value>/user/ranger/hive/

</value> </property>

保存后，重启服务

sudo systemctl restart hive-server2.service

sudo vi /etc/spark/conf/spark-defaults.conf

增加
spark.yarn.stagingDir /user/ranger/spark/staging

不需要重启，yarn是实时调用生效的。