基于MR实现ngram语言模型

在大数据的今天，世界上任何一台单机都无法处理大数据，无论cpu的计算能力或者内存的容量。必须采用分布式来实现多台单机的资源整合，来进行任务的处理，包括离线的批处理和在线的实时处理。

鉴于上次开会讲了语言模型的发展，从规则到后来的NNLM。本章的目的就是锻炼动手能力，在知道原理的基础上，通过采用MR范式，自己实现一个ngram语言模型。

首先通过maven来管理相关包的依赖。

 <?xml version="1.0" encoding="UTF-8"?>

 <project xmlns="http://maven.apache.org/POM/4.0.0"

          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

     <modelVersion>4.0.0</modelVersion>

     <groupId>com.dingheng</groupId>

     <artifactId>nragmMR</artifactId>

     <version>1.0-SNAPSHOT</version>

     <packaging>jar</packaging>

     <dependencies>

         <dependency>

             <groupId>org.apache.hadoop</groupId>

             <artifactId>hadoop-client</artifactId>

             <version>2.7.2</version>

         </dependency>

         <dependency>

             <groupId>org.apache.hadoop</groupId>

             <artifactId>hadoop-core</artifactId>

             <version>1.2.1</version>

         </dependency>

         <dependency>

             <groupId>org.apache.hadoop</groupId>

             <artifactId>hadoop-common</artifactId>

             <version>2.7.2</version>

         </dependency>

         <dependency>

             <groupId>mysql</groupId>

             <artifactId>mysql-connector-java</artifactId>

             <version>8.0.12</version>

         </dependency>

     </dependencies>

 </project>

然后直接上代码：

1.首先是driver，作为程序的启动文件。

 package com.dingheng;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;

 import org.apache.hadoop.mapreduce.lib.db.DBOutputFormat;

 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 public class Driver {

     public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException {

         // inputDir

         // outputDir

         // NumOfGram

         // topK

         String inputDir = args[0];

         String outputDir = args[1];

         String numOfGram = args[2];

         String threshold = args[3];

         String topK = args[4];

         // first mapreduce

         Configuration configurationNGram = new Configuration();

         configurationNGram.set("textinputformat.recode.delimiter", ".");

         configurationNGram.set("numOfGram", numOfGram);

         Job jobNGram = Job.getInstance(configurationNGram);

         jobNGram.setJobName("NGram");

         jobNGram.setJarByClass(Driver.class);

         jobNGram.setMapperClass(NGram.NGramMapper.class);

         jobNGram.setReducerClass(NGram.NGramReducer.class);

         jobNGram.setOutputKeyClass(Text.class);

         jobNGram.setMapOutputValueClass(IntWritable.class);

         jobNGram.setInputFormatClass(TextInputFormat.class);

         jobNGram.setOutputFormatClass(TextOutputFormat.class);

         TextInputFormat.addInputPath(jobNGram, new Path(inputDir));

         TextOutputFormat.setOutputPath(jobNGram, new Path(outputDir));

         jobNGram.waitForCompletion(true);

         // second mapreduce

         Configuration configurationLanguage = new Configuration();

         configurationLanguage.set("threshold", threshold);

         configurationLanguage.set("topK", topK);

         DBConfiguration.configureDB(configurationLanguage,

                 "com.mysql.jdbc.Driver",

                 "jdbc:mysql://localhost:3306/test",

                 "root",

                 "123456");

         Job jobLanguage = Job.getInstance(configurationLanguage);

         jobLanguage.setJobName("LanguageModel");

         jobLanguage.setJarByClass(Driver.class);

         jobLanguage.setMapperClass(LanguageModel.Map.class);

         jobLanguage.setReducerClass(LanguageModel.Reduce.class);

         jobLanguage.setMapOutputKeyClass(Text.class);

         jobLanguage.setMapOutputValueClass(Text.class);

         jobLanguage.setOutputKeyClass(DBOutputWritable.class);

         jobLanguage.setOutputValueClass(NullWritable.class);

         jobLanguage.setInputFormatClass(TextInputFormat.class);

         jobLanguage.setOutputFormatClass(DBOutputFormat.class);

         DBOutputFormat.setOutput(

                 jobLanguage,

                 "output",

                 new String[] { "starting_phrase", "following_word", "count"});

         TextInputFormat.setInputPaths(jobLanguage, new Path(args[1]));

         jobLanguage.waitForCompletion(true);

     }

 }

Driver

2.然后是自己的定制类，自己定制了output

 package com.dingheng;

 import org.apache.hadoop.mapreduce.lib.db.DBWritable;

 import java.sql.PreparedStatement;

 import java.sql.ResultSet;

 import java.sql.SQLException;

 public class DBOutputWritable implements DBWritable{

     private String starting_phrase;

     private String following_word;

     private int count;

     public DBOutputWritable(String starting_phrase, String following_word, int count) {

         this.starting_phrase = starting_phrase;

         this.following_word = following_word;

         this.count = count;

     }

     public void write(PreparedStatement arg0) throws SQLException {

         arg0.setString(1, starting_phrase);

         arg0.setString(2, following_word);

         arg0.setInt(3, count);

     }

     public void readFields(ResultSet arg0) throws SQLException {

         this.starting_phrase = arg0.getString(1);

         this.following_word = arg0.getString(2);

         this.count = arg0.getInt(3);

     }

 }

DBOutputWritable

3.之后自己的mapper和reducer。我试用了两个MR迭代，每一个迭代写在文件中

 package com.dingheng;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import java.io.IOException;

 public class NGram {

     public static class NGramMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

         int numOfGram;

         @Override

         public void setup(Context context) {

             Configuration conf = context.getConfiguration();

             numOfGram = conf.getInt("numOfGram", 5);

         }

         @Override

         public void map(LongWritable key,

                         Text value,

                         Context context) throws IOException, InterruptedException {

             /*

             input: read sentence

             I love data n=3

             I love -> 1

             love data -> 1

             I love data -> 1

             */

             String line = value.toString().trim().toLowerCase().replaceAll("[^a-z]", " ");

             String[] words = line.split("\\s+");

             if (words.length < 2) {

                 return;

             }

             StringBuilder sb;

             for (int i = 0; i < words.length; i++) {

                 sb = new StringBuilder();

                 sb.append(words[i]);

                 for (int j = 1; i + j < words.length && j < numOfGram; j++) {

                     sb.append(" ");

                     sb.append(words[i + j]);

                     context.write(new Text(sb.toString()), new IntWritable(1));

                 }

             }

         }

     }

     public static class NGramReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

         @Override

         public void reduce(Text key,

                            Iterable<IntWritable> values,

                            Context context) throws IOException, InterruptedException {

             int sum = 0;

             for (IntWritable value: values) {

                 sum = sum + value.get();

             }

             context.write(key, new IntWritable(sum));

         }

     }

 }

NGram

 package com.dingheng;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import java.io.IOException;

 import java.util.*;

 public class LanguageModel {

     public static class Map extends Mapper<LongWritable, Text, Text, Text> {

         // input: I love big data\t10

         // output: key: I love big  value: data = 10

         int threshold;

         @Override

         protected void setup(Context context) throws IOException, InterruptedException {

             Configuration configuration = context.getConfiguration();

             threshold = configuration.getInt("threshold", 20);

         }

         @Override

         public void map(LongWritable key,

                         Text value,

                         Context context) throws IOException, InterruptedException {

             if ((value == null) || (value.toString().trim().length() == 0)) {

                 return;

             }

             String line = value.toString().trim();

             String[] wordsPlusCount = line.split("\t");

             String[] words = wordsPlusCount[0].split("\\s+");

             int count = Integer.valueOf(wordsPlusCount[wordsPlusCount.length - 1]);

             if (wordsPlusCount.length < 2 || count < threshold) {

                 return;

             }

             StringBuilder sb = new StringBuilder();

             for (int i = 0; i < words.length - 1; i++) {

                 sb.append(words[i]);

                 sb.append(" ");

             }

             String outputKey = sb.toString().trim();

             String outputValue = words[words.length - 1];

             if (!(outputKey.length() < 1)) {

                 context.write(new Text(outputKey), new Text(outputValue + "=" + count));

             }

         }

     }

     public static class Reduce extends Reducer<Text, Text, DBOutputWritable, NullWritable> {

         int topK;

         @Override

         protected void setup(Context context) throws IOException, InterruptedException {

             Configuration configuration = context.getConfiguration();

             topK = configuration.getInt("topK", 5);

         }

         @Override

         public void reduce(Text key,

                            Iterable<Text> values,

                            Context context) throws IOException, InterruptedException {

             // key: I love big

             // value: <data = 10, girl = 100, boy = 1000 ...>

             TreeMap<Integer, List<String>> tm = new TreeMap<Integer, List<String>>(Collections.<Integer>reverseOrder());

             // <10, <data, baby...>>, <100, <girl>>, <1000, <boy>>

             for (Text val : values) {

                 // val: data = 10

                 String value = val.toString().trim();

                 String word = value.split("=")[0].trim();

                 int count = Integer.parseInt(value.split("=")[1].trim());

                 if (tm.containsKey(count)) {

                     tm.get(count).add(word);

                 } else {

                     List<String> list = new ArrayList<String>();

                     list.add(word);

                     tm.put(count, list);

                 }

             }

             Iterator<Integer> iter = tm.keySet().iterator();

             for (int j = 0; iter.hasNext() && j < topK; ) {

                 int keyCount = iter.next();

                 List<String> words = tm.get(keyCount);

                 for (String curWord: words) {

                     context.write(new DBOutputWritable(key.toString(), curWord, keyCount), NullWritable.get());

                     j++;

                 }

             }

         }

     }

 }

LanguageModel

基于MR实现ngram语言模型的更多相关文章

NLP系列(5)_从朴素贝叶斯到N-gram语言模型
作者: 龙心尘 && 寒小阳时间:2016年2月. 出处: http://blog.csdn.net/longxinchen_ml/article/details/50646528 ...
N-gram语言模型简单介绍
N-gram语言模型考虑一个语音识别系统,假设用户说了这么一句话:"I have a gun",因为发音的相似,该语音识别系统发现如下几句话都是可能的候选:1.I have a ...
NLP中的用N-gram语言模型做英语完型填空的环境搭建
本文是对xing_NLP中的用N-gram语言模型做完型填空这样一个NLP项目环境搭建的一个说明,本来想写在README.md中.第一次用github中的wiki,想想尝试一下也不错,然而格式非常的混 ...
OCR技术浅探：基于深度学习和语言模型的印刷文字OCR系统
作者: 苏剑林系列博文: 科学空间 OCR技术浅探:1. 全文简述 OCR技术浅探:2. 背景与假设 OCR技术浅探:3. 特征提取(1) OCR技术浅探:3. 特征提取(2) OCR技术浅探:4. ...
通俗理解N-gram语言模型。（转）
从NLP的最基础开始吧..不过自己看到这里,还没做总结,这里有一篇很不错的解析,可以分享一下. N-gram语言模型考虑一个语音识别系统,假设用户说了这么一句话:“I have a gun”,因为发 ...
N-gram语言模型与马尔科夫假设关系（转）
1.从独立性假设到联合概率链朴素贝叶斯中使用的独立性假设为 P(x1,x2,x3,...,xn)=P(x1)P(x2)P(x3)...P(xn) 去掉独立性假设,有下面这个恒等式,即联合概率链规则 P ...
用CNTK搞深度学习（二）训练基于RNN的自然语言模型 ( language model )
前一篇文章用 CNTK 搞深度学习 (一) 入门介绍了用CNTK构建简单前向神经网络的例子.现在假设读者已经懂得了使用CNTK的基本方法.现在我们做一个稍微复杂一点,也是自然语言挖掘中很火 ...
语言模型（N-Gram）
问题描述:由于公司业务产品中,需要用户自己填写公司名称,而这个公司名称存在大量的乱填现象,因此需要对其做一些归一化的问题.在这基础上,能延伸出一个预测用户填写的公司名是否有效的模型出来. 目标:问题提 ...
基于N-Gram判断句子是否通顺
完整代码实现及训练与测试数据:click me 一.任务描述自然语言通顺与否的判定,即给定一个句子,要求判定所给的句子是否通顺. 二.问题探索与分析拿到这个问题便开 ...

随机推荐

《深入浅出话数据结构》系列之什么是B树、B+树？为什么二叉查找树不行？
本文将为大家介绍B树和B+树,首先介绍了B树的应用场景,为什么需要B树:然后介绍了B树的查询和插入过程:最后谈了B+树针对B树的改进. 在谈B树之前,先说一下B树所针对的应用场景.那么B树是用来做什么 ...
进一步学习 nox 教程，轻松掌握命令行用法
英文 |Command-line usage 出处 | nox 官方文档译者 | 豌豆花下猫@Python猫 Github地址:https://github.com/chinesehuazhou/n ...
月薪30k的Java架构师JVM常见面试题解析
在做程序员的路上经常会遇到的JVM一些经典面试题,今天给大家分享出我自己的解题思路,希望对大家有帮助,后续有空会持续更新. 1.什么情况下会发生栈内存溢出. 思路: 描述栈定义,再描述为什么会溢出,再 ...
final与 static的区别;static代码块以及嵌套类介绍
本篇文章主要分为两个模块进行介绍:1.final,staic,static final之间的异同:2. static 模块:3.嵌套类的概念 1.final,staic,static final之间的 ...
线程上下文类加载器ContextClassLoader内存泄漏隐患
前提今天(2020-01-18)在编写Netty相关代码的时候,从Netty源码中的ThreadDeathWatcher和GlobalEventExecutor追溯到两个和线程上下文类加载器Cont ...
github 删除库
1.查看库 2.选择想要删除的库,点击setting 3.删除库
THUWC2020 自闭记
DAY 1 报道领胸牌和-围巾-! 发现我和 \(ssf\) 小姐姐一个考场. 合影+开幕式宾馆睡了一觉-睡上午觉真的舒服. 合影时在c位! 开幕式.比上次夏令营不知道好到哪里去了,讲话都挺有意思 ...
idea初使用之自动编译
原文地址:https://blog.csdn.net/diaomeng11/article/details/73826564/ 因为公司需要,方便使用框架以及代码整合,使用同一开发集成环境idea,因 ...
ios--->特定构造方法NS_DESIGNATED_INITIALIZER
特定构造方法 1> 后面带有NS_DESIGNATED_INITIALIZER的方法,就是特定构造方法 2> 子类如果重写了父类的[特定构造方法],那么必须用super调用父类的[特定构造 ...
在python3 encode和decode 的使用
说这个问题之前必须的介绍关于编码的在我们这的发展: 首先电脑能识别的最初的语言是二进制 ---010101这种然后在是我们知道的ASSIC码再过了就是 gb2312----------->g ...

基于MR实现ngram语言模型

基于MR实现ngram语言模型的更多相关文章

随机推荐

热门专题