Individual Project Records
At the midnight of September 20, I finished my individual projcet -- a word frequency program. You can find requirements in details at http://www.cnblogs.com/jiel/p/3978727.html
Before beginning coding, I suppose I can finish it in about 4 hours or less, because it seems not difficult. Maybe IO part will cost 1hours and functional part cost 3 hours. But in fact, the program cost me about 8 hours, not including optimzing or other things.
Since the program is not complex, I didn't use object-oriented technology. I divide it in five modules, which are format checking, parameter configuration, words counting, words sorting and file writing. One module corresponds to one function ——checkFormat, config, countWords, sortWords and writeFile.
The first two modules and the last one mudule are relatively simple ones. I just want to talk about the last mudule, file writing, because I am not familiar with file io in c#. But this module is not hard. Use static method of FIle class to get a filestream, and use the stream to initialize a streamwriter, and then you can write in file easy.
The other modules, words counting and words sorting, are the core of this program. Naturally, I use HashTable to implement words counting function, and use ArrayList to implement words sorting function. In simply mode, I first use Regex.split to get all the strings separated by separator, and then check these strings one by one whether they are a word satisfying definition. If a string is a word, I will add it into a word-num HashTable and a word-word HashTable. In the second HashTable I record the minimun directory order format. In mode2 and mode3, things change. I can not just use regex to pick up all continuous two or thress words. For example, if the content is "how are you", I should get two continuous two words, which are "how are" and "are you". My solution is assigning the index parameter in regex.Match method. I did not write Regex.Match here means regex is an object and the method is not a static method.
Review the precess of this code writing, I think it is not difficult as a whole. What costs really a lot of time is looking up information, because I am not very familiar with c#.
Works on performance analysis are as follows.
I have to say the performance analysis task really makes me agitated. I search on Internet for guide and do as guide says, but I cannot get expected result. Below is my permance analysis graph. By the way, I use the third analysis method otherwise I can not get a report.
For some reasons, I put a nice graph first, which I forgot how to get.



From the report I can see that countWords function toke most time. After all the main function of this program is counting words. In function countWords, function GetFiles and function myCountWordsInFiles divide the time about half to half. My center of optimizing is function myCountWordsInFiles.
Let's see information about function myCountWordsFiles.

Now, I know the optimizing target is function count.
In order to test my program, I structure 10 test cases.
1. Test recognition of word
"file123 123file 1er u4y5 asd"
Should be:
asd:1
file123:1
2. Test processing of same words when ignoring case
"File FILE file asd Asd ASD AsD"
Should be:
ASD:4
FILE:3
3.Test recognition of continuous two words
"abc def ghi jkl mno"
Should be:
abc def:1
def ghi:1
ghi jkl:1
jkl mno:1
4. Test sorting
"FILE file ASD asd asD ASC asc ASc"
Should be:
ASC:3
ASD:3
FILE:2
5. Test sorting continuous two words
"hello World
hello world
how are you
how Are you
How are you"
Should be:
Are you:3
How are:3
hello World:2
6. Test sorting continous three words
"how are you
how Are you
fine thank you and YOU
fine Thank you And you
fine thank YOU and"
Should be
fine Thank you:3
Thank you And:3
how Are you:2
you And you:2
P.S In dictionary order, "fine Thank you And you" should be ahead of "fine thank you and YOU", but in ascii order, the order should be as original. So far as I know, two sorting ways are both used by some classmates.
7. Test empty directory and empty file
Should be: a empty output file
8. Test separator
"sgq&qwge#wet@wqe t$111sdf"
Should recognize words: sgq qwge wet wqe
9. Test files with suffix ".h", ".cs", ".cpp", ".txt" and files with other suffixes.
Only content in files with suffix ".h", ".cs", ".cpp", ".txt" should be counted
10. Test with vast files including all above cases
Maybe I can consider that I have finished this project. But I do not think I obtain enough payback, compared to the time I spent on it. And I think the standard of evaluation that grade n correspond to 1/n of full points is really really sucks. It gives me heavy pressure. Maybe the biggest harvest is that I just wrote my firsh English blog. But thinking I wrote it for CE with that kind of standard of evaluation, I can not say I am happy.
-----------------------------------------------------------------I am a gorgeous separator------------------------------------------------------------------
At 2014-9-22 21:15:02, I just finish a round optimizing. Below are my thoughts and measures.
In my program, I found that in order to sort HashTable, too much time is spent to pick a key/value pair from it and initialize a new object and put it in an ArrayList. And, I found in this program there is no need to use HashTable. At first, I use Word, a class to express words and their appearing times, to do all things. But the effciency should decrease! After referring to chm, I found that the Contains method in HashTable is O(1), and that is why my adjusts on my program makes it slower. So, I come up with a new idea. I still use HastTable, Word as Key and Word as Value, and Word == Word. I means they have the same reference. So I can utilize the effcient Contains method. Now, my program can finish word statistics vs2012 program with extended mode2 in 1'40''. Perhaps it is still not good enough. And maybe there is an adorable bug somewhere in my program, ruining all my work these days. That's just stupid.
I got new analysis report, but I still do not know how to turn those dlls into corresponding functions.


-----------------------------------------------------------------I am another gorgeous separator------------------------------------------------------------
Multithreading is a deep hole! This morning(2014-9-24) I debug until 2:00 a.m, even thinking about not sleeping to fix it.
Referring to
, I write a multithreading word frequency program. Unfortunately, It always misses some words. I am sure enough week. Codes are as below for reference.
using System;
using System.Threading;
using System.Collections.Concurrent;
using System.Collections;
using System.IO;
using System.Text;
using System.Linq;
using System.Text.RegularExpressions; namespace ThreadSimple
{ class Program
{
// 字段定义
// 三种模式
// mode1仅统计单个单词
// mode2额外统计连续的两个单词
// mode3额外统计连续的三个单词
enum modes { mode1, mode2, mode3 };
// 指示模式的静态变量,供各个函数使用
static modes mode;
static string directoryPath;
// 给用户的用法提示信息
static string usage =
"Usage: Myapp.exe [-e2]/[-e3] <directory-name>\nAttention: directory-name with space should in double quotation marks.\nAnd no need to add '\' at the end of your directory path";
// 模式1:简单模式,用来查找单词
static Regex re1 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式1前加一个空格,供模式2寻找连续单词时使用
static Regex re1p = new Regex(" [a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式2,单词+一个空格+单词
static Regex re2 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式2前加一个空格,供模式3寻找连续单词时使用
static Regex re2p = new Regex(" [a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");
// 模式3,单词+一个空格+单词+一个空格+单词
static Regex re3 = new Regex("[a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]* [a-zA-Z]{3}[a-zA-Z0-9]*");
// 命令行格式错误的退出函数
static void oops1()
{
System.Console.WriteLine(usage);
Environment.Exit(-1);
}
// 命令行参数错误的退出函数
static void oops2()
{
System.Console.WriteLine("Invalid directory path");
Environment.Exit(-2);
}
// 检查命令行格式是否正确
static void checkFormat(string[] args)
{
if (args.Length != 1 && args.Length != 2)
{
oops1();
}
if (args.Length == 2 && !(args[0].Equals("-e2") || args[0].Equals("-e3")))
{
oops1();
}
directoryPath = (args.Length == 1) ? args[0] : args[1];
//System.Console.WriteLine("your input directory path is:\n" + directoryPath);
if (directoryPath.EndsWith("\""))
{
directoryPath = directoryPath.Remove(directoryPath.Length - 1);
}
if (!Directory.Exists(directoryPath))
{
oops2();
}
}
// 检查命令行参数是否正确
static void config(string[] args)
{
if (args.Length == 2)
{
resultEx = new ConcurrentDictionary<string, Word>(1, 50000);
if (args[0].Equals("-e2"))
{
mode = modes.mode2;
}
else
{
mode = modes.mode3;
}
}
else
{
mode = modes.mode1;
}
} static ConcurrentDictionary<string, Word> resultEx;
static ConcurrentDictionary<string, Word> result = new ConcurrentDictionary<string, Word>(1, 50000);
static int[] tablet = new int[128];
static BlockingCollection<string> queue;
static Thread WorkerTh0;
static Thread WorkerTh1;
static Thread WorkerTh2;
static void Main(string[] args)
{
// 先检查格式
checkFormat(args);
// 再检查参数
config(args);
string rootdir = directoryPath; //string[] files = Directory.GetFiles(rootdir, filePattern, SearchOption.AllDirectories);
var files = from file in System.IO.Directory.GetFiles(rootdir, "*.*", System.IO.SearchOption.AllDirectories)
where file.EndsWith(".cpp", StringComparison.OrdinalIgnoreCase) ||
file.EndsWith(".txt", StringComparison.OrdinalIgnoreCase) ||
file.EndsWith(".cs", StringComparison.OrdinalIgnoreCase) ||
file.EndsWith(".h", StringComparison.OrdinalIgnoreCase)
select file;
queue = new BlockingCollection<string>(100);
Thread FileIOth = new Thread(delegate() { Read(files); });
FileIOth.Start(); WorkerTh0 = new Thread(delegate()
{
Process();
});
WorkerTh0.Start();
WorkerTh1 = new Thread(delegate()
{
Process();
});
WorkerTh1.Start(); WorkerTh2 = new Thread(delegate()
{
Process();
FileStream fs = File.Create("Name.txt");
//byte[] buffer;
StreamWriter sw = new StreamWriter(fs);
var outputResult = from value in result
orderby value.Value
select new StringBuilder(value.Value.word).Append(":").Append(value.Value.times);
foreach (var str in outputResult)
{
sw.WriteLine(str);
}
if (mode != modes.mode1)
{
var outputResultEx = from value in resultEx
orderby value.Value
select new StringBuilder(value.Value.word).Append(":").Append(value.Value.times);
int count = 0;
foreach (var str in outputResultEx)
{
++count;
sw.WriteLine(str);
if (count >= 10)
{
break;
}
}
}
sw.Close();
fs.Close();
//DateTime ot = DateTime.Now;
//Console.WriteLine("Time: " + ((ot.Minute * 60 + ot.Second) * 1000 + ot.Millisecond - (dt.Minute * 60 + dt.Second) * 1000 - dt.Millisecond) + "ms");
//Console.ReadKey();
});
WorkerTh2.Start(); } public static void Read(IEnumerable files)
{
foreach (string file in files)
{
queue.TryAdd(ReadFile(file), -1);
}
queue.TryAdd("\\END", -1);
} public static string ReadFile(string file)
{
string readLine;
StreamReader sr = new System.IO.StreamReader(file);
//FileStream fs = new FileStream(file, FileMode.Open);
//StreamReader sr = new StreamReader(fs);
readLine = sr.ReadToEnd();
sr.Close();
//fs.Close();
return readLine;
} // 从文件列表中取出文件
//
public static void Process()
{
string readLine;
while (true)
{
queue.TryTake(out readLine, -1);
if (readLine == "\\END")
{
queue.TryAdd("\\END", -1);
break;
}
Compute(readLine);
}
}
public static void countEx(string s)
{
Word w = null;
if (resultEx.TryGetValue(s.ToUpper(), out w))
{
if (strcmp(w.word, s) > 0)
w.word = s;
w.increase();
}
else
{
resultEx.TryAdd(s.ToUpper(), w = new Word(s));
}
}
public static void count(string s)
{
Word w = null;
if (result.TryGetValue(s.ToUpper(), out w))
{
if (strcmp(w.word, s) > 0)
w.word = s;
w.increase();
}
else
{
result.TryAdd(s.ToUpper(), w = new Word(s));
}
}
public static void Compute(string readLine)
{
string content = readLine;
string[] splited = Regex.Split(readLine, "[^a-zA-Z0-9]");
// 再判断分割出的部分是否符合word的定义
foreach (string s in splited)
{ // 如果符合定义,就对其进行计数
if (Regex.IsMatch(s, "^[a-zA-Z]{3}[a-zA-Z0-9]*"))
{
count(s);
}
}
if (mode == modes.mode2)
{
Match match, mtp;
int index = 0;
while ((match = re2.Match(content, index)).Success)
{
countEx(match.Value);
mtp = re1p.Match(content, index);
index = match.Index + match.Length - mtp.Length + 1;
}
}
if (mode == modes.mode3)
{
Match match, mtp;
int index = 0;
while ((match = re3.Match(content, index)).Success)
{
countEx(match.Value);
mtp = re2p.Match(content, index);
index = match.Index + match.Length - mtp.Length + 1;
}
}
} public static int strcmp(string word, string tp)
{
int len = Math.Min(tp.Length, word.Length);
for (int i = 0; i < len; ++i)
{
if (word[i] < tp[i])
return -1;
else if (word[i] > tp[i])
return 1;
}
return word.Length - tp.Length;
} public static StringBuilder ToLower(StringBuilder str)
{
for (int i = 0; i < str.Length; i++)
{
if (str[i] <= 'Z')
{
str[i] = (char)((int)str[i] + 32);
}
}
return str;
}
} class Word : IComparer, IComparable
{
public string word { get; set; }
public int times { get; set; }
public Word increase() { ;++times; return this; }
public string newWord(string w)
{
Console.WriteLine("old word = {0} and new word = {1}", word, w);
if (w.CompareTo(word) > 0)
{
word = w;
}
Console.WriteLine("Word=" + word);
return word;
}
public override bool Equals(object obj)
{
if (obj is Word)
{
return ((Word)obj).word.ToUpper().Equals(this.word.ToUpper());
}
if (obj is String)
{
return ((String)obj).ToUpper().Equals(this.word.ToUpper());
}
return false;
}
public override int GetHashCode()
{
return word.ToUpper().GetHashCode();
}
public int CompareTo(Object w)
{
if (times == ((Word)w).times)
{
string tp = ((Word)w).word;
int len = Math.Min(tp.Length, word.Length);
for (int i = 0; i < len; ++i)
{
if (word[i] < tp[i])
return -1;
else if (word[i] > tp[i])
return 1;
}
return word.Length - tp.Length;
}
else
{
return ((Word)w).times - times;
}
} public int Compare(Object wa, Object wb)
{
if (((Word)wa).times == ((Word)wb).times)
{
return ((Word)wa).word.CompareTo(((Word)wb).word);
}
else
{
return ((Word)wb).times - ((Word)wa).times;
}
}
public Word(string fa, int times = 1)
{
word = fa;
this.times = times;
}
public override string ToString()
{
return word + ":" + times;
}
}
}
Due to time pressure, I did not do much optimazation on it, some useless codes are not deleted, and some bugs are not fixed. Just for learning about multithreading.
As to single thread program, Dictionary is better than HashTable. Besides, Dictionary provides thread-safe expand.
Individual Project Records的更多相关文章
- Individual Project - Word frequency program-11061171-MaoYu
BUAA Advanced Software Engineering Project: Individual Project - Word frequency program Ryan Mao (毛 ...
- Note: SE Class's Individual Project
虽然第一个Project还有点小问题需要修改,但是大体已经差不多了,先把blog记在这里,算是开博第一篇吧! 1.项目预计的用时 本来看到这个题的时候想的并不多,但是看了老师的要求才觉得如此麻烦ORZ ...
- 《软件工程》individual project开发小记(一)
今天周四没有想去上的课,早八点到中午11点半,下午吃完饭后稍微完善了一下,目前代码可以在dev c++和vs2012上正常运行,性能分析我看资料上一大坨,考虑到目前状态不太好,脑袋转不动了,决定先放一 ...
- SoftwareEngineering Individual Project - Word frequency program
说实话前面c#实在没怎么学过.这次写起来感觉非常陌生,就连怎么引用名空间都忘记了.在经过恶补后还是慢慢地适应了. 1.项目预计用时: 构建并写出大概的数据结构,程序框架及模块: 30min 实现文件夹 ...
- Individual Project - Word frequency program
1.项目预计用时 -计划学习C#和百度一些用法的时间:5小时 -项目本身打算写两个类,一个是遍历搜索文件夹的,另外一个用来统计单词.计划用时:5小时 2.项目实际用时 学习C#以及正则表达式的用法:3 ...
- Individual Project - Word frequency program - Multi Thread And Optimization
作业说明详见:http://www.cnblogs.com/jiel/p/3978727.html 一.开始写代码前的规划: 1.尝试用C#来写,之前没有学过C#,所以打算先花1天的时间学习C# 2. ...
- 1415-2个人项目Individual Project
作业要求: 个人独立完成,实践PSP相关知识. 时 间: 两周. (本来截止4月30日,考虑到刚迁移平台,延缓至5月7日) 实践目标: Github基本源代码控制方法 利用Junit4进行程序模块的测 ...
- Project: Individual Project - Word frequency program----11061192zmx
Description & Requirements http://www.cnblogs.com/jiel/p/3311400.html 项目时间估计 理解项目要求: 1小时 构建项目逻辑: ...
- Project: Individual Project - Word frequency program-11061160顾泽鹏
一.预计用时: (1)明确要求:15min: (2)文件的遍历:1h: (3)Simple mode 词频统计:0.5h: (4)extend mode 词频统计:1h: (5)对单词词频排序输出:0 ...
随机推荐
- [日常] HEOI 2019 退役记
HEOI 2019 退役记 先开坑 坐等AFO 啥时候想起来就更一点(咕咕咕) Day 0 早上打了个LCT, 打完一遍过编译一遍AC...(看来不考这玩意了) 然后进行了一些精神文明建设活动奶了一口 ...
- php 错误1
Maximum execution time of 30 seconds exceeded 方法一,修改php.ini文件 max_execution_time = 30; Maximum execu ...
- Django商城项目笔记No.7用户部分-注册接口-判断用户名和手机号是否存在
Django商城项目笔记No.7用户部分-注册接口-判断用户名和手机号是否存在 判断用户名是否存在 后端视图代码实现,在users/view.py里编写如下代码 class UsernameCount ...
- windows 2012 抓明文密码方法
windows 2012 抓明文密码方法 默认配置是抓不到明文密码了,神器mimikatz显示Password为null Authentication Id : 0 ; 121279 (0000000 ...
- HTTP协议详解之url与会话管理
1 当我们访问一个网址的时候,这中间发生了什么 输入网址——浏览器查找域名的IP地址——浏览器给Web服务器发送一个HTTP请求——服务端处理请—— 服务端发回一个HTTP响应——浏览器渲染显示HTM ...
- cpu高占用,线程堆栈,jstack,pstack,jmap, kill -3 pid,java(weblogic,tomcat)
1 ps -mp pid -o THREAD,tid,time 2 printf "%x\n" tid 3 jstack pid |grep tid -A 30
- 地球椭球体(Ellipsoid)、大地基准面(Datum)及地图投影(Projection)三者的基本概念
地球椭球体(Ellipsoid) 众所周知我们的地球表面是一个凸凹不平的表面,而对于地球测量而言,地表是一个无法用数学公式表达的曲面,这样的曲面不能作为测量和制图的基准面.假想一个扁率极小的椭圆,绕大 ...
- odoo返写数据
#确认按钮 反写回合同页面,当前页面反写数据: def action_split_order_ht(self,cr,uid,ids,context=None): assert len(ids)==1 ...
- redis系列--redis4.0深入持久化
前言 在之前的博文中已经详细的介绍了redis4.0基础部分,并且在memcache和redis对比中提及redis提供可靠的数据持久化方案,而memcache没有数据持久化方案,本篇博文将详细介绍r ...
- 20155330 《网络对抗》 Exp2 后门原理与实践
20155330 <网络对抗> 实验二 后门原理与实践 基础问题回答 例举你能想到的一个后门进入到你系统中的可能方式? 在网站上下载非官方软件,所下载的软件中携带伪装过的后门程序. 例举你 ...