Individual Project Records
At the midnight of September 20, I finished my individual projcet -- a word frequency program. You can find requirements in details at http://www.cnblogs.com/jiel/p/3978727.html
Before beginning coding, I suppose I can finish it in about 4 hours or less, because it seems not difficult. Maybe IO part will cost 1hours and functional part cost 3 hours. But in fact, the program cost me about 8 hours, not including optimzing or other things.
Since the program is not complex, I didn't use object-oriented technology. I divide it in five modules, which are format checking, parameter configuration, words counting, words sorting and file writing. One module corresponds to one function ——checkFormat, config, countWords, sortWords and writeFile.
The first two modules and the last one mudule are relatively simple ones. I just want to talk about the last mudule, file writing, because I am not familiar with file io in c#. But this module is not hard. Use static method of FIle class to get a filestream, and use the stream to initialize a streamwriter, and then you can write in file easy.
The other modules, words counting and words sorting, are the core of this program. Naturally, I use HashTable to implement words counting function, and use ArrayList to implement words sorting function. In simply mode, I first use Regex.split to get all the strings separated by separator, and then check these strings one by one whether they are a word satisfying definition. If a string is a word, I will add it into a word-num HashTable and a word-word HashTable. In the second HashTable I record the minimun directory order format. In mode2 and mode3, things change. I can not just use regex to pick up all continuous two or thress words. For example, if the content is "how are you", I should get two continuous two words, which are "how are" and "are you". My solution is assigning the index parameter in regex.Match method. I did not write Regex.Match here means regex is an object and the method is not a static method.
Review the precess of this code writing, I think it is not difficult as a whole. What costs really a lot of time is looking up information, because I am not very familiar with c#.
Works on performance analysis are as follows.
I have to say the performance analysis task really makes me agitated. I search on Internet for guide and do as guide says, but I cannot get expected result. Below is my permance analysis graph. By the way, I use the third analysis method otherwise I can not get a report.
For some reasons, I put a nice graph first, which I forgot how to get.
From the report I can see that countWords function toke most time. After all the main function of this program is counting words. In function countWords, function GetFiles and function myCountWordsInFiles divide the time about half to half. My center of optimizing is function myCountWordsInFiles.
Let's see information about function myCountWordsFiles.
Now, I know the optimizing target is function count.
In order to test my program, I structure 10 test cases.
1. Test recognition of word
"file123 123file 1er u4y5 asd"
Should be:
asd:1
file123:1
2. Test processing of same words when ignoring case
"File FILE file asd Asd ASD AsD"
Should be:
ASD:4
FILE:3
3.Test recognition of continuous two words
"abc def ghi jkl mno"
Should be:
abc def:1
def ghi:1
ghi jkl:1
jkl mno:1
4. Test sorting
"FILE file ASD asd asD ASC asc ASc"
Should be:
ASC:3
ASD:3
FILE:2
5. Test sorting continuous two words
"hello World
hello world
how are you
how Are you
How are you"
Should be:
Are you:3
How are:3
hello World:2
6. Test sorting continous three words
"how are you
how Are you
fine thank you and YOU
fine Thank you And you
fine thank YOU and"
Should be
fine Thank you:3
Thank you And:3
how Are you:2
you And you:2
P.S In dictionary order, "fine Thank you And you" should be ahead of "fine thank you and YOU", but in ascii order, the order should be as original. So far as I know, two sorting ways are both used by some classmates.
7. Test empty directory and empty file
Should be: a empty output file
8. Test separator
"sgq&qwge#wet@wqe t$111sdf"
Should recognize words: sgq qwge wet wqe
9. Test files with suffix ".h", ".cs", ".cpp", ".txt" and files with other suffixes.
Only content in files with suffix ".h", ".cs", ".cpp", ".txt" should be counted
10. Test with vast files including all above cases
Maybe I can consider that I have finished this project. But I do not think I obtain enough payback, compared to the time I spent on it. And I think the standard of evaluation that grade n correspond to 1/n of full points is really really sucks. It gives me heavy pressure. Maybe the biggest harvest is that I just wrote my firsh English blog. But thinking I wrote it for CE with that kind of standard of evaluation, I can not say I am happy.
-----------------------------------------------------------------I am a gorgeous separator------------------------------------------------------------------
At 2014-9-22 21:15:02, I just finish a round optimizing. Below are my thoughts and measures.
In my program, I found that in order to sort HashTable, too much time is spent to pick a key/value pair from it and initialize a new object and put it in an ArrayList. And, I found in this program there is no need to use HashTable. At first, I use Word, a class to express words and their appearing times, to do all things. But the effciency should decrease! After referring to chm, I found that the Contains method in HashTable is O(1), and that is why my adjusts on my program makes it slower. So, I come up with a new idea. I still use HastTable, Word as Key and Word as Value, and Word == Word. I means they have the same reference. So I can utilize the effcient Contains method. Now, my program can finish word statistics vs2012 program with extended mode2 in 1'40''. Perhaps it is still not good enough. And maybe there is an adorable bug somewhere in my program, ruining all my work these days. That's just stupid.
I got new analysis report, but I still do not know how to turn those dlls into corresponding functions.
-----------------------------------------------------------------I am another gorgeous separator------------------------------------------------------------
Multithreading is a deep hole! This morning(2014-9-24) I debug until 2:00 a.m, even thinking about not sleeping to fix it.
Referring to , I write a multithreading word frequency program. Unfortunately, It always misses some words. I am sure enough week. Codes are as below for reference.
using System;
using System.Threading;
using System.Collections.Concurrent;
using System.Collections;
using System.IO;
using System.Text;
using System.Linq;
using System.Text.RegularExpressions; namespace ThreadSimple
{ class Program
{
// 字段定义
// 三种模式
// mode1仅统计单个单词
// mode2额外统计连续的两个单词
// mode3额外统计连续的三个单词
enum modes { mode1, mode2, mode3 };
// 指示模式的静态变量,供各个函数使用
static modes mode;
static string directoryPath;
// 给用户的用法提示信息
static string usage =
"Usage: Myapp.exe [-e2]/[-e3] <directory-name>\nAttention: directory-name with space should in double quotation marks.\nAnd no need to add '\' at the end of your directory path";
// 模式1:简单模式,用来查找单词
static Regex re1 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式1前加一个空格,供模式2寻找连续单词时使用
static Regex re1p = new Regex(" [a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式2,单词+一个空格+单词
static Regex re2 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式2前加一个空格,供模式3寻找连续单词时使用
static Regex re2p = new Regex(" [a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式3,单词+一个空格+单词+一个空格+单词
static Regex re3 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");
// 命令行格式错误的退出函数
static void oops1()
{
System.Console.WriteLine(usage);
Environment.Exit(-1);
}
// 命令行参数错误的退出函数
static void oops2()
{
System.Console.WriteLine("Invalid directory path");
Environment.Exit(-2);
}
// 检查命令行格式是否正确
static void checkFormat(string[] args)
{
if (args.Length != 1 && args.Length != 2)
{
oops1();
}
if (args.Length == 2 && !(args[0].Equals("-e2") || args[0].Equals("-e3")))
{
oops1();
}
directoryPath = (args.Length == 1) ? args[0] : args[1];
//System.Console.WriteLine("your input directory path is:\n" + directoryPath);
if (directoryPath.EndsWith("\""))
{
directoryPath = directoryPath.Remove(directoryPath.Length - 1);
}
if (!Directory.Exists(directoryPath))
{
oops2();
}
}
// 检查命令行参数是否正确
static void config(string[] args)
{
if (args.Length == 2)
{
resultEx = new ConcurrentDictionary<string, Word>(1, 50000);
if (args[0].Equals("-e2"))
{
mode = modes.mode2;
}
else
{
mode = modes.mode3;
}
}
else
{
mode = modes.mode1;
}
} static ConcurrentDictionary<string, Word> resultEx;
static ConcurrentDictionary<string, Word> result = new ConcurrentDictionary<string, Word>(1, 50000);
static int[] tablet = new int[128];
static BlockingCollection<string> queue;
static Thread WorkerTh0;
static Thread WorkerTh1;
static Thread WorkerTh2;
static void Main(string[] args)
{
// 先检查格式
checkFormat(args);
// 再检查参数
config(args);
string rootdir = directoryPath; //string[] files = Directory.GetFiles(rootdir, filePattern, SearchOption.AllDirectories);
var files = from file in System.IO.Directory.GetFiles(rootdir, "*.*", System.IO.SearchOption.AllDirectories)
where file.EndsWith(".cpp", StringComparison.OrdinalIgnoreCase) ||
file.EndsWith(".txt", StringComparison.OrdinalIgnoreCase) ||
file.EndsWith(".cs", StringComparison.OrdinalIgnoreCase) ||
file.EndsWith(".h", StringComparison.OrdinalIgnoreCase)
select file;
queue = new BlockingCollection<string>(100);
Thread FileIOth = new Thread(delegate() { Read(files); });
FileIOth.Start(); WorkerTh0 = new Thread(delegate()
{
Process();
});
WorkerTh0.Start();
WorkerTh1 = new Thread(delegate()
{
Process();
});
WorkerTh1.Start(); WorkerTh2 = new Thread(delegate()
{
Process();
FileStream fs = File.Create("Name.txt");
//byte[] buffer;
StreamWriter sw = new StreamWriter(fs);
var outputResult = from value in result
orderby value.Value
select new StringBuilder(value.Value.word).Append(":").Append(value.Value.times);
foreach (var str in outputResult)
{
sw.WriteLine(str);
}
if (mode != modes.mode1)
{
var outputResultEx = from value in resultEx
orderby value.Value
select new StringBuilder(value.Value.word).Append(":").Append(value.Value.times);
int count = 0;
foreach (var str in outputResultEx)
{
++count;
sw.WriteLine(str);
if (count >= 10)
{
break;
}
}
}
sw.Close();
fs.Close();
//DateTime ot = DateTime.Now;
//Console.WriteLine("Time: " + ((ot.Minute * 60 + ot.Second) * 1000 + ot.Millisecond - (dt.Minute * 60 + dt.Second) * 1000 - dt.Millisecond) + "ms");
//Console.ReadKey();
});
WorkerTh2.Start(); } public static void Read(IEnumerable files)
{
foreach (string file in files)
{
queue.TryAdd(ReadFile(file), -1);
}
queue.TryAdd("\\END", -1);
} public static string ReadFile(string file)
{
string readLine;
StreamReader sr = new System.IO.StreamReader(file);
//FileStream fs = new FileStream(file, FileMode.Open);
//StreamReader sr = new StreamReader(fs);
readLine = sr.ReadToEnd();
sr.Close();
//fs.Close();
return readLine;
} // 从文件列表中取出文件
//
public static void Process()
{
string readLine;
while (true)
{
queue.TryTake(out readLine, -1);
if (readLine == "\\END")
{
queue.TryAdd("\\END", -1);
break;
}
Compute(readLine);
}
}
public static void countEx(string s)
{
Word w = null;
if (resultEx.TryGetValue(s.ToUpper(), out w))
{
if (strcmp(w.word, s) > 0)
w.word = s;
w.increase();
}
else
{
resultEx.TryAdd(s.ToUpper(), w = new Word(s));
}
}
public static void count(string s)
{
Word w = null;
if (result.TryGetValue(s.ToUpper(), out w))
{
if (strcmp(w.word, s) > 0)
w.word = s;
w.increase();
}
else
{
result.TryAdd(s.ToUpper(), w = new Word(s));
}
}
public static void Compute(string readLine)
{
string content = readLine;
string[] splited = Regex.Split(readLine, "[^a-zA-Z0-9]");
// 再判断分割出的部分是否符合word的定义
foreach (string s in splited)
{ // 如果符合定义,就对其进行计数
if (Regex.IsMatch(s, "^[a-zA-Z]{3}[a-zA-Z0-9]*"))
{
count(s);
}
}
if (mode == modes.mode2)
{
Match match, mtp;
int index = 0;
while ((match = re2.Match(content, index)).Success)
{
countEx(match.Value);
mtp = re1p.Match(content, index);
index = match.Index + match.Length - mtp.Length + 1;
}
}
if (mode == modes.mode3)
{
Match match, mtp;
int index = 0;
while ((match = re3.Match(content, index)).Success)
{
countEx(match.Value);
mtp = re2p.Match(content, index);
index = match.Index + match.Length - mtp.Length + 1;
}
}
} public static int strcmp(string word, string tp)
{
int len = Math.Min(tp.Length, word.Length);
for (int i = 0; i < len; ++i)
{
if (word[i] < tp[i])
return -1;
else if (word[i] > tp[i])
return 1;
}
return word.Length - tp.Length;
} public static StringBuilder ToLower(StringBuilder str)
{
for (int i = 0; i < str.Length; i++)
{
if (str[i] <= 'Z')
{
str[i] = (char)((int)str[i] + 32);
}
}
return str;
}
} class Word : IComparer, IComparable
{
public string word { get; set; }
public int times { get; set; }
public Word increase() { ;++times; return this; }
public string newWord(string w)
{
Console.WriteLine("old word = {0} and new word = {1}", word, w);
if (w.CompareTo(word) > 0)
{
word = w;
}
Console.WriteLine("Word=" + word);
return word;
}
public override bool Equals(object obj)
{
if (obj is Word)
{
return ((Word)obj).word.ToUpper().Equals(this.word.ToUpper());
}
if (obj is String)
{
return ((String)obj).ToUpper().Equals(this.word.ToUpper());
}
return false;
}
public override int GetHashCode()
{
return word.ToUpper().GetHashCode();
}
public int CompareTo(Object w)
{
if (times == ((Word)w).times)
{
string tp = ((Word)w).word;
int len = Math.Min(tp.Length, word.Length);
for (int i = 0; i < len; ++i)
{
if (word[i] < tp[i])
return -1;
else if (word[i] > tp[i])
return 1;
}
return word.Length - tp.Length;
}
else
{
return ((Word)w).times - times;
}
} public int Compare(Object wa, Object wb)
{
if (((Word)wa).times == ((Word)wb).times)
{
return ((Word)wa).word.CompareTo(((Word)wb).word);
}
else
{
return ((Word)wb).times - ((Word)wa).times;
}
}
public Word(string fa, int times = 1)
{
word = fa;
this.times = times;
}
public override string ToString()
{
return word + ":" + times;
}
}
}
Due to time pressure, I did not do much optimazation on it, some useless codes are not deleted, and some bugs are not fixed. Just for learning about multithreading.
As to single thread program, Dictionary is better than HashTable. Besides, Dictionary provides thread-safe expand.
Individual Project Records的更多相关文章
- Individual Project - Word frequency program-11061171-MaoYu
BUAA Advanced Software Engineering Project: Individual Project - Word frequency program Ryan Mao (毛 ...
- Note: SE Class's Individual Project
虽然第一个Project还有点小问题需要修改,但是大体已经差不多了,先把blog记在这里,算是开博第一篇吧! 1.项目预计的用时 本来看到这个题的时候想的并不多,但是看了老师的要求才觉得如此麻烦ORZ ...
- 《软件工程》individual project开发小记(一)
今天周四没有想去上的课,早八点到中午11点半,下午吃完饭后稍微完善了一下,目前代码可以在dev c++和vs2012上正常运行,性能分析我看资料上一大坨,考虑到目前状态不太好,脑袋转不动了,决定先放一 ...
- SoftwareEngineering Individual Project - Word frequency program
说实话前面c#实在没怎么学过.这次写起来感觉非常陌生,就连怎么引用名空间都忘记了.在经过恶补后还是慢慢地适应了. 1.项目预计用时: 构建并写出大概的数据结构,程序框架及模块: 30min 实现文件夹 ...
- Individual Project - Word frequency program
1.项目预计用时 -计划学习C#和百度一些用法的时间:5小时 -项目本身打算写两个类,一个是遍历搜索文件夹的,另外一个用来统计单词.计划用时:5小时 2.项目实际用时 学习C#以及正则表达式的用法:3 ...
- Individual Project - Word frequency program - Multi Thread And Optimization
作业说明详见:http://www.cnblogs.com/jiel/p/3978727.html 一.开始写代码前的规划: 1.尝试用C#来写,之前没有学过C#,所以打算先花1天的时间学习C# 2. ...
- 1415-2个人项目Individual Project
作业要求: 个人独立完成,实践PSP相关知识. 时 间: 两周. (本来截止4月30日,考虑到刚迁移平台,延缓至5月7日) 实践目标: Github基本源代码控制方法 利用Junit4进行程序模块的测 ...
- Project: Individual Project - Word frequency program----11061192zmx
Description & Requirements http://www.cnblogs.com/jiel/p/3311400.html 项目时间估计 理解项目要求: 1小时 构建项目逻辑: ...
- Project: Individual Project - Word frequency program-11061160顾泽鹏
一.预计用时: (1)明确要求:15min: (2)文件的遍历:1h: (3)Simple mode 词频统计:0.5h: (4)extend mode 词频统计:1h: (5)对单词词频排序输出:0 ...
随机推荐
- leetcode 121 买卖股票的最佳时机
题目 给定一个数组,它的第 i 个元素是一支给定股票第 i 天的价格. 如果你最多只允许完成一笔交易(即买入和卖出一支股票),设计一个算法来计算你所能获取的最大利润. 注意你不能在买入股票前卖出股票. ...
- 启动 uiautomatorviewer 时报 SWT folder '..\lib\location of your Java installation.' does not exist.
现象,之前本机上的 uiautomatorviewer 一直是好的,最近这段时间无故就不行了,报如标题错误,网上找了各种办法仍无法有效解决,静心细想上一次使用该工具时到目前对本机有做什么跟系统或者工具 ...
- Markdown基本语法规范
1. 标题 #的个数即表示Hn, 一下依次从h1~h6. 也可在句尾添加同样个数的#(也可忽略) # This is H1 ## This is H2 ### This is H3 #### Thi ...
- swift语言的特征:类型系统与函数式编程:swift是面向类型和面向函数编程的语言
swift语言的特征: 类型系统:值类型与引用类型.泛型.协议类型 函数式编程:
- Centos7 Nginx 开机启动
Centos 系统服务脚本目录: 用户(user) 用户登录后才能运行的程序,存在用户(user) /usr/lib/systemd/ 系统(system) 如需要开机没有登陆情况下就能运行的程序,存 ...
- [POI2007]MEG-Megalopolis
传送门:嘟嘟嘟 第一反应是树链剖分,但是太长懒得写,然后就想出了一个很不错的做法. 想一下,如果我们改一条边,那么影响的只有他的子树,只要先搞一个dfs序,为什么搞出这个呢?因为有一个性质:一个节点的 ...
- leetcode 200. Number of Islands 、694 Number of Distinct Islands 、695. Max Area of Island 、130. Surrounded Regions
两种方式处理已经访问过的节点:一种是用visited存储已经访问过的1:另一种是通过改变原始数值的值,比如将1改成-1,这样小于等于0的都会停止. Number of Islands 用了第一种方式, ...
- JS时间轴效果(类似于qq空间时间轴效果)
在上一家公司写了一个时间轴效果,今天整理了下,感觉有必要写一篇博客出来 给大家分享分享 当然代码还有很多不足的地方,希望大家多指点指点下,此效果类似于QQ空间或者人人网空间时间轴效果,当时也是为了需求 ...
- Python2.7-textwrap
textwrap主要针对英文的文本 模块内方法: wrap(text[, width, ...]),把text分成每行width长,返回一个列表,没有结尾的\n.fill(text[, width, ...
- selenium自动化环境搭建(Windows)
参考内容:虫师<selenium2自动化测试实战-基于python语言> 一.selenium介绍 selenium主要用于web应用程序的自动化测试,还支持所有基于web的管理任务自动化 ...