Individual Project Records

At the midnight of September 20, I finished my individual projcet -- a word frequency program. You can find requirements in details at http://www.cnblogs.com/jiel/p/3978727.html

Before beginning coding, I suppose I can finish it in about 4 hours or less, because it seems not difficult. Maybe IO part will cost 1hours and functional part cost 3 hours. But in fact, the program cost me about 8 hours, not including optimzing or other things.

Since the program is not complex, I didn't use object-oriented technology. I divide it in five modules, which are format checking, parameter configuration, words counting, words sorting and file writing. One module corresponds to one function ——checkFormat, config, countWords, sortWords and writeFile.

The first two modules and the last one mudule are relatively simple ones. I just want to talk about the last mudule, file writing, because I am not familiar with file io in c#. But this module is not hard. Use static method of FIle class to get a filestream, and use the stream to initialize a streamwriter, and then you can write in file easy.

The other modules, words counting and words sorting, are the core of this program. Naturally, I use HashTable to implement words counting function, and use ArrayList to implement words sorting function. In simply mode, I first use Regex.split to get all the strings separated by separator, and then check these strings one by one whether they are a word satisfying definition. If a string is a word, I will add it into a word-num HashTable and a word-word HashTable. In the second HashTable I record the minimun directory order format. In mode2 and mode3, things change. I can not just use regex to pick up all continuous two or thress words. For example, if the content is "how are you", I should get two continuous two words, which are "how are" and "are you". My solution is assigning the index parameter in regex.Match method. I did not write Regex.Match here means regex is an object and the method is not a static method.

Review the precess of this code writing, I think it is not difficult as a whole. What costs really a lot of time is looking up information, because I am not very familiar with c#.

Works on performance analysis are as follows.

I have to say the performance analysis task really makes me agitated. I search on Internet for guide and do as guide says, but I cannot get expected result. Below is my permance analysis graph. By the way, I use the third analysis method otherwise I can not get a report.

For some reasons, I put a nice graph first, which I forgot how to get.

From the report I can see that countWords function toke most time. After all the main function of this program is counting words. In function countWords, function GetFiles and function myCountWordsInFiles divide the time about half to half. My center of optimizing is function myCountWordsInFiles.

Let's see information about function myCountWordsFiles.

Now, I know the optimizing target is function count.

In order to test my program, I structure 10 test cases.

1. Test recognition of word

"file123 123file 1er u4y5 asd"

Should be:

asd:1
file123:1

2. Test processing of same words when ignoring case

"File FILE file asd Asd ASD AsD"

Should be:

ASD:4
FILE:3

3.Test recognition of continuous two words

"abc def ghi jkl mno"

Should be:

abc def:1
def ghi:1
ghi jkl:1
jkl mno:1

4. Test sorting

"FILE file ASD asd asD ASC asc ASc"

Should be:

ASC:3
ASD:3
FILE:2

5. Test sorting continuous two words

"hello World
hello world

how are you
how Are you
How are you"

Should be:

Are you:3
How are:3
hello World:2

6. Test sorting continous three words

"how are you

how Are you

fine thank you and YOU
fine Thank you And you
fine thank YOU and"

Should be

fine Thank you:3
Thank you And:3
how Are you:2
you And you:2

P.S In dictionary order, "fine Thank you And you" should be ahead of "fine thank you and YOU", but in ascii order, the order should be as original. So far as I know, two sorting ways are both used by some classmates.

7. Test empty directory and empty file

Should be: a empty output file

8. Test separator

"sgq&qwge#wet@wqe t$111sdf"

Should recognize words: sgq qwge wet wqe

9. Test files with suffix ".h", ".cs", ".cpp", ".txt" and files with other suffixes.

Only content in files with suffix ".h", ".cs", ".cpp", ".txt" should be counted

10. Test with vast files including all above cases

Maybe I can consider that I have finished this project. But I do not think I obtain enough payback, compared to the time I spent on it. And I think the standard of evaluation that grade n correspond to 1/n of full points is really really sucks. It gives me heavy pressure. Maybe the biggest harvest is that I just wrote my firsh English blog. But thinking I wrote it for CE with that kind of standard of evaluation, I can not say I am happy.

-----------------------------------------------------------------I am a gorgeous separator------------------------------------------------------------------

At 2014-9-22 21:15:02, I just finish a round optimizing. Below are my thoughts and measures.

In my program, I found that in order to sort HashTable, too much time is spent to pick a key/value pair from it and initialize a new object and put it in an ArrayList. And, I found in this program there is no need to use HashTable. At first, I use Word, a class to express words and their appearing times, to do all things. But the effciency should decrease! After referring to chm, I found that the Contains method in HashTable is O(1), and that is why my adjusts on my program makes it slower. So, I come up with a new idea. I still use HastTable, Word as Key and Word as Value, and Word == Word. I means they have the same reference. So I can utilize the effcient Contains method. Now, my program can finish word statistics vs2012 program with extended mode2 in 1'40''. Perhaps it is still not good enough. And maybe there is an adorable bug somewhere in my program, ruining all my work these days. That's just stupid.

I got new analysis report, but I still do not know how to turn those dlls into corresponding functions.

-----------------------------------------------------------------I am another gorgeous separator------------------------------------------------------------

Multithreading is a deep hole! This morning(2014-9-24) I debug until 2:00 a.m, even thinking about not sleeping to fix it.

Referring to , I write a multithreading word frequency program. Unfortunately, It always misses some words. I am sure enough week. Codes are as below for reference.

using System;

using System.Threading;

using System.Collections.Concurrent;

using System.Collections;

using System.IO;

using System.Text;

using System.Linq;

using System.Text.RegularExpressions;

namespace ThreadSimple

{

    class Program

    {

        //  字段定义

        //  三种模式

        //  mode1仅统计单个单词

        //  mode2额外统计连续的两个单词

        //  mode3额外统计连续的三个单词

        enum modes { mode1, mode2, mode3 };

        //  指示模式的静态变量，供各个函数使用

        static modes mode;

        static string directoryPath;

        //  给用户的用法提示信息

        static string usage =

            "Usage: Myapp.exe [-e2]/[-e3] <directory-name>\nAttention: directory-name with space should in double quotation marks.\nAnd no need to add '\' at the end of your directory path";

        //  模式1：简单模式，用来查找单词

        static Regex re1 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]*");

        //  模式1前加一个空格，供模式2寻找连续单词时使用

        static Regex re1p = new Regex(" [a-zA-Z]{3}[a-zA-Z0-9]*");

        //  模式2，单词+一个空格+单词

        static Regex re2 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");

        //  模式2前加一个空格，供模式3寻找连续单词时使用

        static Regex re2p = new Regex(" [a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");

        //  模式3，单词+一个空格+单词+一个空格+单词

        static Regex re3 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");

        //  命令行格式错误的退出函数

        static void oops1()

        {

            System.Console.WriteLine(usage);

            Environment.Exit(-1);

        }

        //  命令行参数错误的退出函数

        static void oops2()

        {

            System.Console.WriteLine("Invalid directory path");

            Environment.Exit(-2);

        }

        //  检查命令行格式是否正确

        static void checkFormat(string[] args)

        {

            if (args.Length != 1 && args.Length != 2)

            {

                oops1();

            }

            if (args.Length == 2 && !(args[0].Equals("-e2") || args[0].Equals("-e3")))

            {

                oops1();

            }

            directoryPath = (args.Length == 1) ? args[0] : args[1];

            //System.Console.WriteLine("your input directory path is:\n" + directoryPath);

            if (directoryPath.EndsWith("\""))

            {

                directoryPath = directoryPath.Remove(directoryPath.Length - 1);

            }

            if (!Directory.Exists(directoryPath))

            {

                oops2();

            }

        }

        //  检查命令行参数是否正确

        static void config(string[] args)

        {

            if (args.Length == 2)

            {

                resultEx = new ConcurrentDictionary<string, Word>(1, 50000);

                if (args[0].Equals("-e2"))

                {

                    mode = modes.mode2;

                }

                else

                {

                    mode = modes.mode3;

                }

            }

            else

            {

                mode = modes.mode1;

            }

        }

        static ConcurrentDictionary<string, Word> resultEx;

        static ConcurrentDictionary<string, Word> result = new ConcurrentDictionary<string, Word>(1, 50000);

        static int[] tablet = new int[128];

        static BlockingCollection<string> queue;

        static Thread WorkerTh0;

        static Thread WorkerTh1;

        static Thread WorkerTh2;

        static void Main(string[] args)

        {

            //  先检查格式

            checkFormat(args);

            //  再检查参数

            config(args);

            string rootdir = directoryPath;

            //string[] files = Directory.GetFiles(rootdir, filePattern, SearchOption.AllDirectories);

            var files = from file in System.IO.Directory.GetFiles(rootdir, "*.*", System.IO.SearchOption.AllDirectories)

                        where file.EndsWith(".cpp", StringComparison.OrdinalIgnoreCase) ||

                        file.EndsWith(".txt", StringComparison.OrdinalIgnoreCase) ||

                        file.EndsWith(".cs", StringComparison.OrdinalIgnoreCase) ||

                        file.EndsWith(".h", StringComparison.OrdinalIgnoreCase)

                        select file;

            queue = new BlockingCollection<string>(100);

            Thread FileIOth = new Thread(delegate() { Read(files); });

            FileIOth.Start();

            WorkerTh0 = new Thread(delegate()

            {

                Process();

            });

            WorkerTh0.Start();

            WorkerTh1 = new Thread(delegate()

            {

                Process();

            });

            WorkerTh1.Start();

            WorkerTh2 = new Thread(delegate()

            {

                Process();

                FileStream fs = File.Create("Name.txt");

                //byte[] buffer;

                StreamWriter sw = new StreamWriter(fs);

                var outputResult = from value in result

                                   orderby value.Value

                                   select new StringBuilder(value.Value.word).Append(":").Append(value.Value.times);

                foreach (var str in outputResult)

                {

                    sw.WriteLine(str);

                }

                if (mode != modes.mode1)

                {

                    var outputResultEx = from value in resultEx

                                       orderby value.Value

                                       select new StringBuilder(value.Value.word).Append(":").Append(value.Value.times);

                    int count = 0;

                    foreach (var str in outputResultEx)

                    {

                        ++count;

                        sw.WriteLine(str);

                        if (count >= 10)

                        {

                            break;

                        }

                    }

                }

                sw.Close();

                fs.Close();

                //DateTime ot = DateTime.Now;

                //Console.WriteLine("Time: " + ((ot.Minute * 60 + ot.Second) * 1000 + ot.Millisecond - (dt.Minute * 60 + dt.Second) * 1000 - dt.Millisecond) + "ms");

                //Console.ReadKey();

            });

            WorkerTh2.Start();

        }

        public static void Read(IEnumerable files)

        {

            foreach (string file in files)

            {

                queue.TryAdd(ReadFile(file), -1);

            }

            queue.TryAdd("\\END", -1);

        }

        public static string ReadFile(string file)

        {

            string readLine;

            StreamReader sr = new System.IO.StreamReader(file);

            //FileStream fs = new FileStream(file, FileMode.Open);

            //StreamReader sr = new StreamReader(fs);

            readLine = sr.ReadToEnd();

            sr.Close();

            //fs.Close();

            return readLine;

        }

        //  从文件列表中取出文件

        //

        public static void Process()

        {

            string readLine;

            while (true)

            {

                queue.TryTake(out readLine, -1);

                if (readLine == "\\END")

                {

                    queue.TryAdd("\\END", -1);

                    break;

                }

                Compute(readLine);

            }

        }

        public static void countEx(string s)

        {

            Word w = null;

            if (resultEx.TryGetValue(s.ToUpper(), out w))

            {

                if (strcmp(w.word, s) > 0)

                    w.word = s;

                w.increase();

            }

            else

            {

                resultEx.TryAdd(s.ToUpper(), w = new Word(s));

            }

        }

        public static void count(string s)

        {

            Word w = null;

            if (result.TryGetValue(s.ToUpper(), out w))

            {

                if (strcmp(w.word, s) > 0)

                    w.word = s;

                w.increase();

            }

            else

            {

                result.TryAdd(s.ToUpper(), w = new Word(s));

            }

        }

        public static void Compute(string readLine)

        {

            string content = readLine;

            string[] splited = Regex.Split(readLine, "[^a-zA-Z0-9]");

            //  再判断分割出的部分是否符合word的定义

            foreach (string s in splited)

            {

                //  如果符合定义，就对其进行计数

                if (Regex.IsMatch(s, "^[a-zA-Z]{3}[a-zA-Z0-9]*"))

                {

                    count(s);

                }

            }

            if (mode == modes.mode2)

            {

                Match match, mtp;

                int index = 0;

                while ((match = re2.Match(content, index)).Success)

                {

                    countEx(match.Value);

                    mtp = re1p.Match(content, index);

                    index = match.Index + match.Length - mtp.Length + 1;

                }

            }

            if (mode == modes.mode3)

            {

                Match match, mtp;

                int index = 0;

                while ((match = re3.Match(content, index)).Success)

                {

                    countEx(match.Value);

                    mtp = re2p.Match(content, index);

                    index = match.Index + match.Length - mtp.Length + 1;

                }

            }

        }

        public static int strcmp(string word, string tp)

        {

            int len = Math.Min(tp.Length, word.Length);

            for (int i = 0; i < len; ++i)

            {

                if (word[i] < tp[i])

                    return -1;

                else if (word[i] > tp[i])

                    return 1;

            }

            return word.Length - tp.Length;

        }

        public static StringBuilder ToLower(StringBuilder str)

        {

            for (int i = 0; i < str.Length; i++)

            {

                if (str[i] <= 'Z')

                {

                    str[i] = (char)((int)str[i] + 32);

                }

            }

            return str;

        }

    }

    class Word : IComparer, IComparable

    {

        public string word { get; set; }

        public int times { get; set; }

        public Word increase() { ;++times; return this; }

        public string newWord(string w)

        {

            Console.WriteLine("old word = {0} and new word = {1}", word, w);

            if (w.CompareTo(word) > 0)

            {

                word = w;

            }

            Console.WriteLine("Word=" + word);

            return word;

        }

        public override bool Equals(object obj)

        {

            if (obj is Word)

            {

                return ((Word)obj).word.ToUpper().Equals(this.word.ToUpper());

            }

            if (obj is String)

            {

                return ((String)obj).ToUpper().Equals(this.word.ToUpper());

            }

            return false;

        }

        public override int GetHashCode()

        {

            return word.ToUpper().GetHashCode();

        }

        public int CompareTo(Object w)

        {

            if (times == ((Word)w).times)

            {

                string tp = ((Word)w).word;

                int len = Math.Min(tp.Length, word.Length);

                for (int i = 0; i < len; ++i)

                {

                    if (word[i] < tp[i])

                        return -1;

                    else if (word[i] > tp[i])

                        return 1;

                }

                return word.Length - tp.Length;

            }

            else

            {

                return ((Word)w).times - times;

            }

        }

        public int Compare(Object wa, Object wb)

        {

            if (((Word)wa).times == ((Word)wb).times)

            {

                return ((Word)wa).word.CompareTo(((Word)wb).word);

            }

            else

            {

                return ((Word)wb).times - ((Word)wa).times;

            }

        }

        public Word(string fa, int times = 1)

        {

            word = fa;

            this.times = times;

        }

        public override string ToString()

        {

            return word + ":" + times;

        }

    }

}

Due to time pressure, I did not do much optimazation on it, some useless codes are not deleted, and some bugs are not fixed. Just for learning about multithreading.

As to single thread program, Dictionary is better than HashTable. Besides, Dictionary provides thread-safe expand.

Individual Project Records的更多相关文章

Individual Project - Word frequency program-11061171-MaoYu
BUAA Advanced Software Engineering Project: Individual Project - Word frequency program Ryan Mao (毛 ...
Note: SE Class's Individual Project
虽然第一个Project还有点小问题需要修改,但是大体已经差不多了,先把blog记在这里,算是开博第一篇吧! 1.项目预计的用时本来看到这个题的时候想的并不多,但是看了老师的要求才觉得如此麻烦ORZ ...
《软件工程》individual project开发小记(一)
今天周四没有想去上的课,早八点到中午11点半,下午吃完饭后稍微完善了一下,目前代码可以在dev c++和vs2012上正常运行,性能分析我看资料上一大坨,考虑到目前状态不太好,脑袋转不动了,决定先放一 ...
SoftwareEngineering Individual Project - Word frequency program
说实话前面c#实在没怎么学过.这次写起来感觉非常陌生,就连怎么引用名空间都忘记了.在经过恶补后还是慢慢地适应了. 1.项目预计用时: 构建并写出大概的数据结构,程序框架及模块: 30min 实现文件夹 ...
Individual Project - Word frequency program
1.项目预计用时 -计划学习C#和百度一些用法的时间:5小时 -项目本身打算写两个类,一个是遍历搜索文件夹的,另外一个用来统计单词.计划用时:5小时 2.项目实际用时学习C#以及正则表达式的用法:3 ...
Individual Project - Word frequency program - Multi Thread And Optimization
作业说明详见:http://www.cnblogs.com/jiel/p/3978727.html 一.开始写代码前的规划: 1.尝试用C#来写,之前没有学过C#,所以打算先花1天的时间学习C# 2. ...
1415-2个人项目Individual Project
作业要求: 个人独立完成,实践PSP相关知识. 时间: 两周. (本来截止4月30日,考虑到刚迁移平台,延缓至5月7日) 实践目标: Github基本源代码控制方法利用Junit4进行程序模块的测 ...
Project: Individual Project - Word frequency program----11061192zmx
Description & Requirements http://www.cnblogs.com/jiel/p/3311400.html 项目时间估计理解项目要求: 1小时构建项目逻辑: ...
Project: Individual Project - Word frequency program-11061160顾泽鹏
一.预计用时: (1)明确要求:15min: (2)文件的遍历:1h: (3)Simple mode 词频统计:0.5h: (4)extend mode 词频统计:1h: (5)对单词词频排序输出:0 ...

随机推荐

mac系统默认python3.6
1. 终端打开.bash_profile文件终端输入:open ~/.bash_profile 2. 打开.bash_profile文件后在内容最后添加 alias python=" ...
Java读取json文件并对json数据进行读取、添加、删除与修改操作
转载:http://blog.csdn.net/qing_yun/article/details/46865863#t0 1.介绍开发过程中经常会遇到json数据的处理,而单独对json数据进行 ...
前端工程构建工具之Yeoman
一.Yeoman 简介通常在开发新项目时我们都需要配置工程环境,开发目录,需要下载一些库.框架文件(如 jQuery.Backbone 等),配置编译环境(Less.Sass.Coffeescrip ...
Python基础5
本节内容 1. 函数基本语法及特性 2. 参数与局部变量 3. 返回值嵌套函数 4.递归 5.匿名函数 6.函数式编程介绍 7.高阶函数 8.内置函数温故知新 1. 集合主要作用: 去重关系测 ...
Tushare test
查看版本 import tushare print(tushare.__version__) 1.2.12 初步的调用方法为: import tushare as ts ts.get_hist_dat ...
构造方法、 This关键字、static、封装
1.1 构造方法构造方法是一种特殊的方法,专门用于构造/实例化对象,形式: [修饰符] 类名(){ } 构造方法根据是否有参数分为无参构造和有参构. 1.1.1 无参构造无参构造方法就是构造方法没 ...
istio1.0.2配置
项目的组件相对比较复杂,原有的一些选项是靠 ConfigMap 以及 istioctl 分别调整的,现在通过重新设计的Helm Chart,安装选项用values.yml或者 helm 命令行的方式来 ...
vagrant up下载box慢的解决办法
即在运行vagrant up时得到其的下载路径,如: https://vagrantcloud.com/ubuntu/boxes/xenial64/versions/20190101.0.0/prov ...
在ASP.NET Core上实施每个租户策略的数据库
在ASP.NET Core上实施每个租户策略的数据库不定时更新翻译系列,此系列更新毫无时间规律,文笔菜翻译菜求各位看官老爷们轻喷,如觉得我翻译有问题请挪步原博客地址本博文翻译自: http://g ...
Cloud Foundry 组件
原文:https://blog.csdn.net/little_crab_0924/article/details/78022391 Cloud Foundry 组件概述 Cloud Foundry ...

Individual Project Records

Individual Project Records的更多相关文章

随机推荐

热门专题