Documentation for Built-In User-Defined Functions Related To XPath

UDFs

xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string

  • Functions for parsing XML data using XPath expressions.
  • Since version: 0.6.0

    Overview

The xpath family of UDFs are wrappers around the Java XPath library javax.xml.xpath provided by the JDK. The library is based on the XPath 1.0 specification. Please refer to http://java.sun.com/javase/6/docs/api/javax/xml/xpath/package-summary.html for detailed information on the Java XPath library.

All functions follow the form: xpath_*(xml_string, xpath_expression_string). The XPath expression string is compiled and cached. It is reused if the expression in the next input row matches the previous. Otherwise, it is recompiled. So, the xml string is always parsed for every input row, but the xpath expression is precompiled and reused for the vast majority of use cases.

Backward axes are supported. For example:

> select xpath ('<a><b id="1"><c/></b><b id="2"><c/></b></a>','/descendant::c/ancestor::b/@id') from t1 limit 1 ;
[1","2]

Each function returns a specific Hive type given the XPath expression:

  • xpath returns a Hive array of strings.
  • xpath_string returns a string.
  • xpath_boolean returns a boolean.
  • xpath_short returns a short integer.
  • xpath_int returns an integer.
  • xpath_long returns a long integer.
  • xpath_float returns a floating point number.
  • xpath_double,xpath_number returns a double-precision floating point number (xpath_number is an alias for xpath_double).

The UDFs are schema agnostic - no XML validation is performed. However, malformed xml (e.g., <a><b>1</b></aa>) will result in a runtime exception being thrown.

Following are specifics on each xpath UDF variant.

xpath

The xpath() function always returns a hive array of strings. If the expression results in a non-text value (e.g., another xml node) the function will return an empty array. There are 2 primary uses for this function: to get a list of node text values or to get a list of attribute values.

Examples:

Non-matching XPath expression:

> select xpath('<a><b>b1</b><b>b2</b></a>','a/*') from src limit 1 ;
[]

Get a list of node text values:

> select xpath('<a><b>b1</b><b>b2</b></a>','a/*/text()') from src limit 1 ;
[b1","b2]

Get a list of values for attribute 'id':

> select xpath('<a><b id="foo">b1</b><b id="bar">b2</b></a>','//@id') from src limit 1 ;
[foo","bar]

Get a list of node texts for nodes where the 'class' attribute equals 'bb':

> SELECT xpath ('<a><b class="bb">b1</b><b>b2</b><b>b3</b><c class="bb">c1</c><c>c2</c></a>''a/*[@class="bb"]/text()') FROM src LIMIT 1 ;
[b1","c1]

xpath_string

The xpath_string() function returns the text of the first matching node.

Get the text for node 'a/b':

> SELECT xpath_string ('<a><b>bb</b><c>cc</c></a>''a/b') FROM src LIMIT 1 ;
bb

Get the text for node 'a'. Because 'a' has children nodes with text, the result is a composite of text from the children.

> SELECT xpath_string ('<a><b>bb</b><c>cc</c></a>''a') FROM src LIMIT 1 ;
bbcc

Non-matching expression returns an empty string:

> SELECT xpath_string ('<a><b>bb</b><c>cc</c></a>''a/d') FROM src LIMIT 1 ;

Gets the text of the first node that matches '//b':

> SELECT xpath_string ('<a><b>b1</b><b>b2</b></a>''//b') FROM src LIMIT 1 ;
b1

Gets the second matching node:

> SELECT xpath_string ('<a><b>b1</b><b>b2</b></a>''a/b[2]') FROM src LIMIT 1 ;
b2

Gets the text from the first node that has an attribute 'id' with value 'b_2':

> SELECT xpath_string ('<a><b>b1</b><b id="b_2">b2</b></a>''a/b[@id="b_2"]') FROM src LIMIT 1 ;
b2

xpath_boolean

Returns true if the XPath expression evaluates to true, or if a matching node is found.

Match found:

> SELECT xpath_boolean ('<a><b>b</b></a>''a/b') FROM src LIMIT 1 ;
true

No match found:

> SELECT xpath_boolean ('<a><b>b</b></a>''a/c') FROM src LIMIT 1 ;
false

Match found:

> SELECT xpath_boolean ('<a><b>b</b></a>''a/b = "b"') FROM src LIMIT 1 ;
true

No match found:

> SELECT xpath_boolean ('<a><b>10</b></a>''a/b < 10') FROM src LIMIT 1 ;
false

xpath_short, xpath_int, xpath_long

These functions return an integer numeric value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Mathematical operations are supported. In cases where the value overflows the return type, then the maximum value for the type is returned.

No match:

> SELECT xpath_int ('<a>b</a>''a = 10') FROM src LIMIT 1 ;
0

Non-numeric match:

> SELECT xpath_int ('<a>this is not a number</a>''a') FROM src LIMIT 1 ;
0
> SELECT xpath_int ('<a>this 2 is not a number</a>''a') FROM src LIMIT 1 ;
0

Adding values:

> SELECT xpath_int ('<a><b class="odd">1</b><b class="even">2</b><b class="odd">4</b><c>8</c></a>''sum(a/*)') FROM src LIMIT 1 ;
15
> SELECT xpath_int ('<a><b class="odd">1</b><b class="even">2</b><b class="odd">4</b><c>8</c></a>''sum(a/b)') FROM src LIMIT 1 ;
7
> SELECT xpath_int ('<a><b class="odd">1</b><b class="even">2</b><b class="odd">4</b><c>8</c></a>''sum(a/b[@class="odd"])') FROM src LIMIT 1 ;
5

Overflow:

> SELECT xpath_int ('<a><b>2000000000</b><c>40000000000</c></a>''a/b * a/c') FROM src LIMIT 1 ;
2147483647

xpath_float, xpath_double, xpath_number

Similar to xpath_short, xpath_int and xpath_long but with floating point semantics. Non-matches result in zero. However,
non-numeric matches result in NaN. Note that xpath_number() is an alias for xpath_double().

No match:

> SELECT xpath_double ('<a>b</a>''a = 10') FROM src LIMIT 1 ;
0.0

Non-numeric match:

> SELECT xpath_double ('<a>this is not a number</a>''a') FROM src LIMIT 1 ;
NaN

A very large number:

SELECT xpath_double ('<a><b>2000000000</b><c>40000000000</c></a>''a/b * a/c') FROM src LIMIT 1 ;
8.0E19

[HIve - LanguageManual] XPathUDF的更多相关文章

  1. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  2. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  3. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  4. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  5. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  6. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

  7. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

  8. [Hive - LanguageManual] Hive Concurrency Model (待)

    Hive Concurrency Model Hive Concurrency Model Use Cases Turn Off Concurrency Debugging Configuration ...

  9. [Hive - LanguageManual ] Explain (待)

    EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...

随机推荐

  1. 10位顶级PHP大师的开发原则

    在Web开发世界里,PHP是最流行的语言之一,从PHP里,你能够很容易的找到你所需的脚本,遗憾的是,很少人会去用“最佳做法”去写一个PHP程序.这里,我们向大家介绍PHP的10种最佳实践,当然,每一种 ...

  2. ubuntu 乱码 改为英文

    http://878045653.blog.51cto.com/2693110/735654 解决方法: 改成全英文环境来解决 方格 乱码 : 用vim配置语言环境变量 vim / etc/envir ...

  3. 22.allegro中PCB打印设置[原创]

    1. -- 2. 3. 4. ----

  4. java实现最基础的socket网络通信

    一.网络通信基础 网络中存在很多的通信实体,每一个通信实体都有一个标识符就是IP地址. 而现实中每一个网络实体可以和多个通信程序同时进行网络通信,这就需要使用端口号进行区分. 二.java中的基本网络 ...

  5. tomcat web.xml 配置

    1<web-app> 2<error-page> 3<error-code>404</error-code> 4<location>/Not ...

  6. Enabling HierarchyViewer on Rooted Android Devices

    转自http://blog.apkudo.com/2012/07/26/enabling-hierarchyviewer-on-rooted-android-devices/. The Hierarc ...

  7. IIS 10.0 无法安装 URL rewrite重写模块 2.0解决办法

    [问题描述]系统升级到Windows10后,IIS是10.0的,发现无法安装 URLRewrite重写模块 2.0. [解决办法]打开注册表编辑器,在HKEY_LOCAL_MACHINE\SOFTWA ...

  8. HDU 3467 (求五个圆相交面积) Song of the Siren

    还没开始写题解我就已经内牛满面了,从晚饭搞到现在,WA得我都快哭了呢 题意: 在DotA中,你现在1V5,但是你的英雄有一个半径为r的眩晕技能,已知敌方五个英雄的坐标,问能否将该技能投放到一个合适的位 ...

  9. 【转】JAVA之动态代理

    转自:像少年啦飞驰 代理设计模式 定义:为其他对象提供一种代理以控制对这个对象的访问. 代理模式的结构如下图所示. 动态代理使用 java动态代理机制以巧妙的方式实现了代理模式的设计理念. 代理模式示 ...

  10. UVA 11354 Bond(最小瓶颈路+倍增)

    题意:问图上任意两点(u,v)之间的路径上,所经过的最大边权最小为多少? 求最小瓶颈路,既是求最小生成树.因为要处理多组询问,所以需要用倍增加速. 先处理出最小生成树,prim的时间复杂度为O(n*n ...