Java实现LSH（Locality Sensitive Hash ）

　　在对大批量数据进行图像处理的时候，比如说我提取SIFT特征，数据集为10W张图片，一个SIFT特征点是128维，一张图片提取出500个特征点，这样我们在处理的时候就是对5000万个128维的数据进行处理，这样处理所需要的耗时太长了，不符合实际生产的需要。我们需要用一种方法降低运算量，比如说降维。

　　看了一些论文，提到的较多的方法是LSH（Locality Sensitive Hash），就是局部敏感哈希。我们利用LSH方法在5000万个特征点中筛选出极少量的我们需要的特征点，在对这些极少量的数据进行计算，就可以得到我们想要的结果啦。

 package com.demo.lsh;

 import com.demo.config.Constant;

 import com.demo.dao.FeatureDao;

 import com.demo.dao.FeatureTableDao;

 import com.demo.dao.HashTableDao;

 import com.demo.entity.HashTable;

 import com.demo.utils.MD5Util;

 import com.demo.utils.MathUtil;

 import org.opencv.core.Mat;

 import org.springframework.util.StringUtils;

 import java.io.*;

 import java.security.MessageDigest;

 import java.security.NoSuchAlgorithmException;

 import java.util.*;

 public class LSH {

     //维度大小，例如对于sift特征来说就是128

     private int dimention = Constant.DIMENTION;

     //所需向量中元素可能的上限，譬如对于RGB来说，就是255

     private int max = Constant.MAX;

     //哈希表的数量，用于更大程度地削减false positive

     private int hashCount = Constant.HASHCOUNT;

     //LSH随机选取的采样位数，该值越小，则近似查找能力越大，但相应的false positive也越大；若该值等于size，则为由近似查找退化为精确匹配

     private int bitCount = Constant.BITCOUNT;

     //转化为01字符串之后的位数，等于max乘以dimensions

     private int size = dimention * max;

     //LSH哈希族，保存了随机采样点的INDEX

     private int[][] hashFamily;

     private HashTableDao hashTableDao;

     /**

      * 构造函数

      */

     public LSH(HashTableDao hashTableDao) {

         this.hashTableDao = hashTableDao;

         dimention = Constant.DIMENTION;

         max = Constant.MAX;

         hashCount = Constant.HASHCOUNT;

         bitCount = Constant.BITCOUNT;

         size = dimention * max;

         hashFamily = new int[hashCount][bitCount];

         generataHashFamily();

     }

     /**

      * 生成随机的投影点 ，在程序第一次执行时生成。投影点可以理解为后面去数组的索引值

      */

     private void generataHashFamily() {

         if (new File("/home/fanxuan/data/1.txt").exists()) {

             try {

                 InputStream in = new FileInputStream("/home/fanxuan/data/1.txt");

                 ObjectInputStream oin = new ObjectInputStream(in);

                 hashFamily = (int[][]) (oin.readObject());

             } catch (IOException e) {

                 e.printStackTrace();

             } catch (ClassNotFoundException e) {

                 e.printStackTrace();

             }

         }else {

             Random rd = new Random();

             for (int i = 0; i < hashCount; i++) {

                 for (int j = 0; j < bitCount; j++) {

                     hashFamily[i][j] = rd.nextInt(size);

                 }

             }

             try {

                 OutputStream out = new FileOutputStream("/home/fanxuan/data/1.txt");

                 ObjectOutputStream oout = new ObjectOutputStream(out);

                 oout.writeObject(hashFamily);

             } catch (FileNotFoundException e) {

                 e.printStackTrace();

             } catch (IOException e) {

                 e.printStackTrace();

             }

         }

     }

     //将向量转化为二进制字符串，比如元素的最大范围255，则元素65就被转化为65个1以及190个0

     private int[] unAray(int[] data) {

         int unArayData[] = new int[size];

         for (int i = 0; i < data.length; i++) {

             for (int j = 0; j < data[i]; j++) {

                 unArayData[i * max + j] = 1;

             }

         }

         return unArayData;

     }

     /**

      * 将向量映射为LSH中的key

      */

     private String generateHashKey(int[] list, int hashNum) {

         StringBuilder sb = new StringBuilder();

         int[] tempData = unAray(list);

         int[] hashedData = new int[bitCount];

         //首先将向量转为二进制字符串

         for (int i = 0; i < bitCount; i++) {

             hashedData[i] = tempData[hashFamily[hashNum][i]];

             sb.append(hashedData[i]);

         }

         //再用常规hash函数比如MD5对key进行压缩

         MessageDigest messageDigest = null;

         try{

             messageDigest = MessageDigest.getInstance("MD5");

         }catch (NoSuchAlgorithmException e) {

         }

         byte[] binary = sb.toString().getBytes();

         byte[] hash = messageDigest.digest(binary);

         String hashV = MD5Util.bufferToHex(hash);

         return hashV;

     }

     /**

      * 将Sift特征点转换为Hash存表

      */

     public void generateHashMap(String id, int[] vercotr, int featureId) {

         for (int j = 0; j < hashCount; j++) {

             String key = generateHashKey(vercotr, j);

             HashTable hashTableUpdateOrAdd = new HashTable();

             HashTable hashTable = hashTableDao.findHashTableByBucketId(key);

             if (hashTable != null) {

                 String featureIdValue = hashTable.getFeatureId() + "," + featureId;

                 hashTableUpdateOrAdd.setFeatureId(featureIdValue);

                 hashTableUpdateOrAdd.setBucketId(key);

                 hashTableDao.updateHashTableFeatureId(hashTableUpdateOrAdd);

             } else {

                 hashTableUpdateOrAdd.setBucketId(key);

                 hashTableUpdateOrAdd.setFeatureId(String.valueOf(featureId));

                 hashTableDao.insertHashTable(hashTableUpdateOrAdd);

             }

         }

     }

     // 查询与输入向量最接近（海明空间）的向量

     public List<String> queryList(int[] data) {

         List<String> result = new ArrayList<>();

         for (int j = 0; j < hashCount; j++) {

             String key = generateHashKey(data, j);

             result.add(key);

             HashTable hashTable = hashTableDao.findHashTableByBucketId(key);

             if (!StringUtils.isEmpty(hashTable.getFeatureId())) {

                 String[] str = hashTable.getFeatureId().split(",");

                 for (String string : str) {

                     result.add(string);

                 }

             }

         }

         return result;

     }

 }

 package com.demo.config;

 public class Constant {

     //维度大小，例如对于sift特征来说就是128

     public static final int DIMENTION = 128;

     //所需向量中元素可能的上限，譬如对于RGB来说，就是255

     public static final int MAX = 255;

     //哈希表的数量，用于更大程度地削减false positive

     public static final int HASHCOUNT = 12;

     //LSH随机选取的采样位数，该值越小，则近似查找能力越大，但相应的false positive也越大；若该值等于size，则为由近似查找退化为精确匹配

     public static final int BITCOUNT = 32;

 }

　　简单的介绍下代码，构造函数LSH（）用来建立LSH对象，hashTableDao为数据表操作对象，不多说;因为局部敏感哈希依赖与一套随机数，每次产生的结果都不一致，所以我们需要在程序第一次运行的时候将随机数生成并固定下来，我采用的方法是存放在本地磁盘中，也可以存放在数据库中。generateHashMap（）方法为数据训练函数，int[] vercotr为特征向量，其他两个参数为我需要的标志位。queryList（）方法是筛选方法。

　　感谢http://grunt1223.iteye.com/blog/944894的文章。

Java实现LSH（Locality Sensitive Hash ）的更多相关文章

从NLP任务中文本向量的降维问题，引出LSH（Locality Sensitive Hash 局部敏感哈希）算法及其思想的讨论
1. 引言 - 近似近邻搜索被提出所在的时代背景和挑战 0x1:从NN(Neighbor Search)说起 ANN的前身技术是NN(Neighbor Search),简单地说,最近邻检索就是根据数据 ...
Locality Sensitive Hash 局部敏感哈希
Locality Sensitive Hash是一种常见的用于处理高维向量的索引办法.与其它基于Tree的数据结构,诸如KD-Tree.SR-Tree相比,它较好地克服了Curse of Dimens ...
LSH(Locality Sensitive Hashing)原理与实现
原文地址:https://blog.csdn.net/guoziqing506/article/details/53019049 LSH(Locality Sensitive Hashing)翻译成中 ...
Locality Sensitive Hashing，LSH
1. 基本思想局部敏感(Locality Senstitive):即空间中距离较近的点映射后发生冲突的概率高,空间中距离较远的点映射后发生冲突的概率低. 局部敏感哈希的基本思想类似于一种空间域转换思 ...
[Algorithm] 局部敏感哈希算法(Locality Sensitive Hashing)
局部敏感哈希(Locality Sensitive Hashing,LSH)算法是我在前一段时间找工作时接触到的一种衡量文本相似度的算法.局部敏感哈希是近似最近邻搜索算法中最流行的一种,它有坚实的理论 ...
局部敏感哈希-Locality Sensitive Hashing
局部敏感哈希转载请注明http://blog.csdn.net/stdcoutzyx/article/details/44456679 在检索技术中,索引一直须要研究的核心技术.当下,索引技术主要分 ...
局部敏感哈希算法(Locality Sensitive Hashing)
from:https://www.cnblogs.com/maybe2030/p/4953039.html 阅读目录 1. 基本思想 2. 局部敏感哈希LSH 3. 文档相似度计算局部敏感哈希(Lo ...
转：locality sensitive hashing
Motivation The task of finding nearest neighbours is very common. You can think of applications like ...
转： memcached Java客户端spymemcached的一致性Hash算法
转自:http://colobu.com/2015/04/13/consistent-hash-algorithm-in-java-memcached-client/ memcached Java客户 ...

随机推荐

golang模板语法简明教程(后面有福利哦)
template是go 语言web开发中必不可少的,特此记录下来: [模板标签] 模板标签用"{{"和"}}"括起来 [注释] {{/* a comment ...
Python之函数2 嵌套，作用域和闭包（Day12）
一.函数对象 1.函数是第一类对象,即函数可以当做数据传递 1.1 可以被引用 1.2 可以当做参数传递 1.3 返回值可以是函数 1.4 可以当做容器类型的元素二.函数的嵌套 1.函数嵌套的调用: ...
http,soap and rest
http://www.cnblogs.com/hyhnet/archive/2016/06/28/5624422.html http://www.cnblogs.com/bellkosmos/p/52 ...
MSDN使用
比如我想查一下fopen这个函数怎么用,在索引里搜索一下fopen,很容易找到了. 但是如果我想横向扩展一下,查看一些与fopen相关的函数,应该怎么找呢? 很简单,点击定位: 你就能把fopen定位 ...
Android开发BUG及解决方法1
错误描述: 问题1: Error:Execution failed for task ':app:transformClassesWithDexForDebug'. > com.Android. ...
【HackerRank】 Chocolate Feast
Little Bob loves chocolates, and goes to the store with $N money in his pocket. The price of each ch ...
[转载]OpenWRT使用wifidog实现强制认证的WIFI热点 | 半个橙子
首先安装wifidog到OpenWRT的路由器: opkg update opkg install wifidog wifidog依赖下面这些模块: iptables-mod-extra iptabl ...
无线路由：关于WDS,Repeater等模式的说明
转:http://blog.csdn.net/lizhiqiang5846/article/details/38397803 当今如果不用无线路由连接宽带似乎是很不popular了,当然Wifi/WL ...
【鸟哥的Linux私房菜】笔记3
正确地开机最好不要使用root账号登陆!GNOME图形界面 View items as a list X WindowShell 文本交互界面bash是Shell的名称,Linux的默认壳程序就是b ...
20145230《java学习笔记》第九周学习总结
20145230 <Java程序设计>第9周学习总结教材学习内容 JDBC JDBC简介 JDBC是用于执行SQL的解决方案,开发人员使用JDBC的标准接口,数据库厂商则对接口进行操作, ...

Java实现LSH（Locality Sensitive Hash ）

Java实现LSH（Locality Sensitive Hash ）的更多相关文章

随机推荐

热门专题