Creating a URL stream using a Twitter filter

Start by creating the project directory and standard Maven folder structure (http://maven.apache.org/guides/introduction/introduction-to-the-standard- directory-layout.html).

1. Create the POM as per the Creating a "Hello World" topology recipe in Chapter 1, Setting Up Your Development Environment, updating the <artifactId> and <name> tag values to tfidf-topology, and include the following dependencies:

2. Import the project into Eclipse after generating the Eclipse project files:

mvn eclipse:eclipse

3. Create a new spout called TwitterSpout that extends from BaseRichSpout, and add the following member-level variables:

public class TwitterSpout extends BaseRichSpout {
    LinkedBlockingQueue<Status> queue = null;
    TwitterStream twitterStream;
    String[] trackTerms;
    long maxQueueDepth;
    SpoutOutputCollector collector;
}

4. In the open method of the spout, initialize the blocking queue and create a Twitter stream listener:

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

    queue = new LinkedBlockingQueue<Status>(1000);

    StatusListener listener = new StatusListener() {
        @Override
        public void onStatus(Status status) {
            if(queue.size() < maxQueueDepth){
                 LOG.trace("TWEET Received: " + status);
                 queue.offer(status);
            }
            else {
              LOG.error("Queue is now full, the following message is dropped: "+status);
            }
        }
    };

    twitterStream = new TwitterStreamFactory().getInstance();
    twitterStream.addListener(listener);

    FilterQuery filter = new FilterQuery();
    filter.count(0);
    filter.track(trackTerms);
    twitterStream.filter(filter);
}

5. Then create the Twitter stream and filter

6. You then need to emit the tweet into the topology.

public void nextTuple() {

    Status ret = queue.poll();

    if(ret == null) {
        try {
            Thread.sleep(50);
        }
        catch (InterruptedException e) {}
    }
    else {
        collector.emit(new Values(ret));
    }
}

7. Next, you must create a bolt to publish the tuple persistently to another topology within the same cluster. Create a BaseRichBolt class called PublishURLBolt that doesn't declare any fields, and provide the following execute method:

public class PublishURLBolt extends BaseRichBolt {

    public void execute(Tuple input) {
        Status ret = (Status) input.getValue(0);
        URLEntity[] urls = ret.getURLEntities();

        for(int i = 0; i < urls.length; i++) {
              jedis.rpush("url", urls[i].getURL().trim());
        }
    }
} 

8. Finally, you will need to read the URL into a stream in the Trident topology. To do this, create another spout called TweetURLSpout:

public class TweetURLSpout {

    @Override
    public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
        outputFieldsDeclarer.declare(new Fields("url"));
    }

    @Override
    public void open(Map conf, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) {
        host = conf.get(Conf.REDIS_HOST_KEY).toString();
        port = Integer.valueOf(conf.get(Conf.REDIS_PORT_KEY).toString());
        this.collector = spoutOutputCollector;

        connectToRedis();
    }

    private void connectToRedis() {
        jedis = new Jedis(host, port);
    }

    @Override
    public void nextTuple() {
        String url = jedis.rpop("url");
        if(url==null) {
            try {
                Thread.sleep(50);
            }
            catch (InterruptedException e) {}
        }
        else {
            collector.emit(new Values(url));
        }
    }
} 

Deriving a clean stream of terms from the documents

This recipe consumes the URL stream, downloading the document content and deriving a clean stream of terms that are suitable for later analysis. 

A clean term is defined as a word that:
> Is not a stop word
> Is a valid dictionary word
> Is not a number or URL
> Is a lemma

A lemma is the canonical form of a word; for example, run, runs, ran, and running are forms of the same lexeme with "run" as the lemma. Lexeme, in this context, refers to the set of all the forms that have the same meaning, and lemma refers to the particular form that is chosen by convention to represent the lexeme.

The lemma is important for this recipe because it enables us to group terms that have the same meaning. Where their frequency of occurrence is important, this grouping is important.

1. Create a class named DocumentFetchFunction, that extends from storm.trident.operation.BaseFunction, and provide the following implementation for the execute method:

public class DocumentFetchFunction extends BaseFunction {

    public void execute(TridentTuple tuple, TridentCollector collector) {
        String url = tuple.getStringByField("url");
        try {
            Parser parser = new AutoDetectParser();
            Metadata metadata = new Metadata();
            ParseContext parseContext = new ParseContext();
            URL urlObject = new URL(url);
            ContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);

            parser.parse((InputStream)urlObject.getContent(), handler, metadata, parseContext);
            String[] mimeDetails = metadata.get("Content-Type").split(";");
            if ((mimeDetails.length > 0) && (mimeTypes.contains(mimeDetails[0]))) {
               collector.emit(new Values(handler.toString(), url.trim(), "twitter"));
            }
        }
        catch (Exception e) {
        }
    }
}

2. Next we need to tokenize the document, create another class that extends from BaseFunction and call it DocumentTokenizer. Provide the following execute implementation:

public class DocumentTokenizer extends BaseFunction {

    public void execute(TridentTuple tuple, TridentCollector collector) {
        String documentContents = tuple.getStringByField(TfidfTopologyFields.DOCUMENT);
        TokenStream ts = null;

        try {
            ts = new StopFilter(Version.LUCENE_30,
                  new StandardTokenizer(Version.LUCENE_30, new StringReader(documentContents)),
                  StopAnalyzer.ENGLISH_STOP_WORDS_SET);

             CharTermAttribute termAtt = ts.getAttribute(CharTermAttribute.class);
            while(ts.incrementToken()) {
                  String lemma = MorphaStemmer.stemToken(termAtt.toString());
                  lemma = lemma.trim().replaceAll("\n","").replaceAll("\r", "");
                collector.emit(new Values(lemma));
              }

              ts.close();
        }
        catch (IOException e) {
             LOG.error(e.toString());
        }
        finally {
              if(ts != null) {
                try {
                      ts.close();
                }
                catch (IOException e) {}
            }
        }
    }
}

3. We then need to filter out all the invalid terms that may be emitted by this function. To do this, we need to implement another class that extends BaseFunction called TermFilter. The execute method of this function will simply call a checking function to optionally emit the received tuple. The checking function isKeep() should perform the following validations:

public class TermFilter extends BaseFunction {

    public void execute(TridentTuple tuple, TridentCollector collector) {
        //call isKeep() method
    }

    private boolean isKeep() {
        if(stem == null) {
              return false;
          }

        if(stem.equals("")) {
              return false;
          }

        if(filterTerms.contains(stem)) {
              return false;
          }

        //we don't want integers
        try {
              Integer.parseInt(stem);
              return false;
        }
        catch(Exception e) {}

        //or floating point numbers
        try {
              Double.parseDouble(stem);
              return false;
        }
        catch(Exception e) {}

        try {
              return spellchecker.exist(stem);
        }
        catch (Exception e) {
              LOG.error(e.toString());
              return false;
        }
    }
}

4. The dictionary needs to be initialized during the prepare method for this function:

public void prepare(Map conf, TridentOperationContext context){
    super.prepare(conf, context);

    File dir = new File(System.getProperty("user.home") + "/dictionaries");
    Directory directory;

    try {
        directory = FSDirectory.open(dir);
        spellchecker = new SpellChecker(directory);
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
        URL dictionaryFile = TermFilter.class.getResource("/dictionaries/fulldictionary00.txt");

        spellchecker.indexDictionary(new PlainTextDictionary(new File(dictionaryFile.toURI())), config, true);
    }
    catch (Exception e) {
        LOG.error(e.toString());
        throw new RuntimeException(e);
    }
}

5. Download the dictionary file from http://dl.dropbox.com/u/7215751/ JavaCodeGeeks/LuceneSuggestionsTutorial/fulldictionary00.zip and place it in the src/main/resources/dictionaries folder of your project structure.

6. Finally, you need to create the actual topology, or at least partially for the moment. Create a class named TermTopology that provides a main(String[] args) method and creates a local mode cluster:

public class TermTopology {

    public static void main(String[] args) {
           Config conf = new Config();
        conf.setMaxSpoutPending(20);
        conf.put(Conf.REDIS_HOST_KEY, "localhost");
        conf.put(Conf.REDIS_PORT_KEY, Conf.DEFAULT_JEDIS_PORT);

        if (args.length == 0) {
            LocalDRPC drpc = new LocalDRPC();
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("tfidf", conf, buildTopology(drpc));
            Thread.sleep(60000);
        }
    }
}

7. Then build the appropriate portion of the topology:

public static StormTopology buildTopology(LocalDRPC drpc) {

    TridentTopology topology = new TridentTopology();
    FixedBatchSpout testSpout = new FixedBatchSpout(new Fields("url"), 1, new Values("http://t.co/hP5PM6fm"), new Values("http://t.co/xSFteG23"));
    testSpout.setCycle(true);

    Stream documentStream = topology
        .newStream("tweetSpout", testSpout)
        .parallelismHint(20)
        .each(new Fields("url"), new DocumentFetchFunction(mimeTypes), new Fields("document", "documentId", "source"));

    Stream termStream = documentStream
        .parallelismHint(20)
        .each(new Fields("document"), new DocumentTokenizer(), new Fields("dirtyTerm"))
        .each(new Fields("dirtyTerm"), new TermFilter(), new Fields("term")).project(new Fields("term","documentId","source"));
}

Storm(3) - Calculating Term Importance with Trident的更多相关文章

  1. twitter storm源码走读之7 -- trident topology可靠性分析

    欢迎转载,转载请注明出处,徽沪一郎. 本文详细分析TridentTopology的可靠性实现, TridentTopology通过transactional spout与transactional s ...

  2. Storm入门(十四)Trident API Overview

    The core data model in Trident is the "Stream", processed as a series of batches. A stream ...

  3. twitter storm源码走读之6 -- Trident Topology执行过程分析

    欢迎转载,转载请注明出处,徽沪一郎. TridentTopology是storm提供的高层使用接口,常见的一些SQL中的操作在tridenttopology提供的api中都有类似的影射.关于Tride ...

  4. storm事务

    1. storm 事务 对于容错机制,Storm通过一个系统级别的组件acker,结合xor校验机制判断一个msg是否发送成功,进而spout可以重发该msg,保证一个msg在出错的情况下至少被重发一 ...

  5. Storm系统架构以及代码结构学习

    转自:http://blog.csdn.net/androidlushangderen/article/details/45955833 storm学习系列:http://blog.csdn.net/ ...

  6. Storm编程入门API系列之Storm的Topology的stream grouping

    概念,见博客 Storm概念学习系列之stream grouping(流分组) Storm的stream grouping的Shuffle Grouping 它是随机分组,随机派发stream里面的t ...

  7. Storm编程入门API系列之Storm的定时任务实现

    概念,见博客 Storm概念学习系列之storm的定时任务 Storm的定时任务,分为两种实现方式,都是可以达到目的的. 我这里,分为StormTopologyTimer1.java   和  Sto ...

  8. Storm编程入门API系列之Storm的可靠性的ACK消息确认机制

    概念,见博客 Storm概念学习系列之storm的可靠性  什么业务场景需要storm可靠性的ACK确认机制? 答:想要保住数据不丢,或者保住数据总是被处理.即若没被处理的,得让我们知道. publi ...

  9. storm编程指南

    目录 storm编程指南 (一)创建spout (二)创建split-bolt (三)创建wordcount-bolt (四)创建report-bolt (五)创建topo storm编程指南 @(博 ...

随机推荐

  1. js操作cookie,实现登录密码保存 [转]

    转自:http://blog.csdn.net/zyujie/article/details/8727828 ( 谢谢博主了) js操作cookie,实现登录密码保存.cookie的存放方式是以键值对 ...

  2. [转载] linux cgroup

    原文: http://coolshell.cn/articles/17049.html 感谢左耳朵耗子的精彩文章. 前面,我们介绍了Linux Namespace,但是Namespace解决的问题主要 ...

  3. js 立即执行函数,() .则前面的function 是表达式,不能是函数申明

    fnName(); function fnName(){     ... }//正常,因为‘提升’了函数声明,函数调用可在函数声明之前 fnName(); var fnName=function(){ ...

  4. 安装64位版Oracle11gR2后无法启动SQLDeveloper的解决方案(原创) (2016-10-29 下午01:56)

    安装64位版Oracle11gR2后发现启动SQL Developer时弹出配置java.exe的路径,找到Oracle自带java.exe后产生的路径"C:\app\用户名\product ...

  5. Android网络编程系列 一 TCP/IP协议族之传输层

    这篇借鉴的文章主要是用于后续文章知识点的扩散,在此特作备份和扩散学习交流. 传输层中有TCP协议与UDP协议. 1.UDP介绍 UDP是传输层协议,和TCP协议处于一个分层中,但是与TCP协议不同,U ...

  6. Maven clean时候出现异常

    首先我使用IDEA创建一个空的project,在这个空的project中创建了一个maven module,然后将这个module打包之后,使用maven clean这个target 的时候报错,如下 ...

  7. Linux用户查询

    用户列表文件:/etc/passwd 用户组列表文件:/etc/group 查看系统中有哪些用户:cut -d : -f 1 /etc/passwd 查看可以登录系统的用户:cat /etc/pass ...

  8. 工作流学习——Activiti流程变量五步曲 (zhuan)

    http://blog.csdn.net/zwk626542417/article/details/46648139 ***************************************** ...

  9. 转!!数据库 第一范式(1NF) 第二范式(2NF) 第三范式(3NF)的 联系和区别

    范式:英文名称是 Normal Form,它是英国人 E.F.Codd(关系数据库的老祖宗)在上个世纪70年代提出关系数据库模型后总结出来的,范式是关系数据库理论的基础,也是我们在设计数据库结构过程中 ...

  10. 数据存储之SQLite

    SQLite是目前主流的嵌入式关系型数据库,其最主要的特点就是轻量级.跨平台,当前很多嵌入式操作系统都将其作为数据库首选.虽然SQLite是一款轻型数据库,但是其功能也绝不亚于很多大型关系数据库.学习 ...