BUAA Advanced Software Engineering

Project:  Individual Project - Word frequency program

Ryan Mao (毛宇)-1106116_11061171

Implement a console application to tally the frequency of words under a directory (2 modes).

1)  Before you implement this project, Record your estimate about the time you WILL spend in each component of your program.

Before I started, I split up the assignment into three parts based on my comprehension of the project:

Firstly, I would have to design my approach to the problem (words frequency tally), I decided to divide the whole process as below:

  1. Traverse the directory: A function which is capable of visiting all the files/subfolders under the specific directory. I chose to do it with the algorithm of breadth-first traverse.

Since traverse the directory cannot be achieved without Windows API (as far as I know), I planned to spend half an hour to finish the study of the API.

  1. Statistics: A module used to record how many times a certain word has been read. Factors including the mode, the case and so forth will be taken into consideration. Before I started, I could not tell the details of it, I planned to spend two hours on this part.
  2. Sort&&Output: The process compasses rank the records according to the requirement mentioned. With the assistant of the library “algorithm”, I guessed I could finish it half an hour.
  3. Debug&&improving the performance: In this section, the time is rather unpredictable, thus I planned to spend less than 2 hour on it.
  4. Blog: 1.5 hours.

In conclusion, to finish the assignment, 6.5 hours was estimated.

2)            After you had implemented this project, record the ACTUAL time you spent in each component of your program.

Actually, I spent about 10 hours on this assignment. For each component:

1. 2 hour (Traverse the directory)

2. 2.5 hours (Statistics)

3. 1.5 hour (Sort&&Output)

4. 2.5 hours (Debug&&improving the performance)

5. 1.5 hours (Blog)

3)      Describe how much time you spent on improving the performance of your program, and show a performance analysis graph (generated by VS2012 performance analysis tool), if possible, please show the most costly function in your program.

I did not spent much time on debugging (about 40 minutes), but I spent more time on Improving the performance (1 hour and 50 minutes)

In this case, I use a folder (5MB), which includes 25 English novels.

graph:

Summary

Function Details

Trace the result to find the most time-consuming part: (As below)

It shows that the function used to traverse the files costs the most time (It’s quite surprising since I thought it is the easiest part!)

Let’s go deeper!

Thus, the result is quite obvious that the function statistics used to analyze each file costs the most time.

Deeper…

The result is obvious We should make some improvement on the red parts of the source code.

Analysis

About I just mentioned,

            if(low_first_table.count(temp)==0)//判断是否第一次出现

       {

         low_first_table[temp]=word;

       }

       //统计

       count_table[low_first_table[temp]]++;

Undisputedly, it would be better for me to use the function “find” instead.

Also, about the second part, it would be better if I can store the value of low_first_table[temp] in advance to avoid redundant visits.

So it should be like this:

       if(low_first_table.find(temp)==low_first_table.end())//判断是否第一次出现

       {

         low_first_table[temp]=word;

         count_table[word]++;

       }

       //否则按照其存储结果来统计

       else

       {

         count_table[low_first_table[temp]]++;

       }

Let’s take a look at another possible place to make an improvement: rank.

Firstly, because the structure “Vector<>” is said to have the best capacity allowing random-data-access, indicating that it is right to use Vector under this circumstance. (From C++ Primer. P287).

Secondly, the function “std::sort” is described in document to have a perfect time complexity () for sorting algorithms.

Third, sorting is indispensable due to the requirement.

Thus it would be rather tricky to make an improvement on it.

The Latter Graph (After improvement) is:

It’s not very obvious, but we can still observe the improvement:

3)            Share your 10 test cases, and how did you make sure your program can produce the correct result. (Programs with incorrect result will get 0 points, regardless of speed)

My test cases are designed to contain several “exceptional situation”,includes:

  1. Empty folder (obvious)
  2. Empty file (obvious)
  3. Same words with different upper/lower case. Like “Hello” “hello” “heLLO” .etc.
  4. Same words with different ending numbers. (To test extend mode) Like Windows98, Window99, Window8. Etc.
  5. Words separated with non-alphanumerical char as delimiter. Like: Hello~!!World!****China~~I)()()(Love^&^&You
  6. Different Words with the same frequency. (To test the “sort” function)
  7. String with 3 or less letters. Like Hi, My. Etc
  8. String starts with number. Like 123Hello
  9. String separated with numbers. Like humo99rous.
  10. A large amount of test case. As below: (To see if the program will collapse)

Besides, I also download the Rost English Words Counter From the Internet to compare the result with mine.

5)      Describe what you had learned in this exercise.

In this exercise, I planned to solve the problem in 6.5 hours. However, it took me almost ten hours. I was surprised to learn how unpredictable the software engineering is. Thus, in the sequent studies, I would try to improve my ability of making plans which is considerate and feasible. It would be hard since it requires experience and a broad knowledge on this field. Anyway, I would try my best to learn this course well.

What’s more, I also learned how to use the Performance Analysis feature in Visual Studio 2012. It’s an awesome tool for developers to make improvement on their projects. I would use it more frequently in future.

Finally, I also learned that it is important to understand clearly the user requirement of it. In this case, I misunderstood the word “delimiter” and waste a lot of time on debugging. So, the next time, I would make sure what the requirement really means.

Thanks!

Individual Project - Word frequency program-11061171-MaoYu的更多相关文章

  1. Individual Project - Word frequency program by HJB

    using System;using System.Collections.Generic;using System.IO;using System.Linq;using System.Text;us ...

  2. Limeng:Individual Project: Word frequency program -BUAA Advanced Software Engineering

    11061190-李孟 Implement a console application to tally the frequency of words under a directory (2 mod ...

  3. Individual Project - Word frequency program - Multi Thread And Optimization

    作业说明详见:http://www.cnblogs.com/jiel/p/3978727.html 一.开始写代码前的规划: 1.尝试用C#来写,之前没有学过C#,所以打算先花1天的时间学习C# 2. ...

  4. Individual Project - Word frequency program——12061154Joy

    Description&Requirement: http://www.cnblogs.com/jiel/p/3978727.html 项目时间估计 理解项目要求: 1h 构建项目逻辑: 1h ...

  5. SoftwareEngineering Individual Project - Word frequency program

    说实话前面c#实在没怎么学过.这次写起来感觉非常陌生,就连怎么引用名空间都忘记了.在经过恶补后还是慢慢地适应了. 1.项目预计用时: 构建并写出大概的数据结构,程序框架及模块: 30min 实现文件夹 ...

  6. Individual Project - Word frequency program

    1.项目预计用时 -计划学习C#和百度一些用法的时间:5小时 -项目本身打算写两个类,一个是遍历搜索文件夹的,另外一个用来统计单词.计划用时:5小时 2.项目实际用时 学习C#以及正则表达式的用法:3 ...

  7. Record for Individual Project ( Word frequency program )

    1.  预计时间 ● 对问题总体的理解.规划:10 min ● 设计编写程序:5 h ● 调试: 分模块-40 min; 总体-40min ● 测试(性能分析).改进:1 h 2.  实际用时 ● 对 ...

  8. THE First Individual Project - Word frequency program

    第一次写博客,这次也是本学期写到第一个程序. 老师要求网址:http://www.cnblogs.com/jiel/p/3311400.html#2777556 一.项目预计时间 一开始想使用不熟悉的 ...

  9. Project: Individual Project - Word frequency program----11061192zmx

    Description & Requirements http://www.cnblogs.com/jiel/p/3311400.html 项目时间估计 理解项目要求: 1小时 构建项目逻辑: ...

随机推荐

  1. WPF 显示gif

    using System; using System.IO; using System.Collections.Generic; using System.Windows; using System. ...

  2. linux 的终端字体色和背景色的修改方法(三)

    除了在窗口下修改,配置文件中修改外,还可以用shell来修改,此处为B shell linux BASH shell下设置字体及背景颜色 类型:转载 这篇文章主要介绍了linux BASH shell ...

  3. 【原创】express3.4.8源码解析之路由中间件

    前言 注意:旧文章转成markdown格式. 跟大家聊一个中间件,叫做路由中间件,它并非是connect中内置的中间件,而是在express中集成进去的. 显而易见,该中间件的用途就是 ------ ...

  4. sql order by按俩个字段排序

    f1用升序, f2降序,sql该这样写 ORDER BY  f1, f2  DESC 也可以这样写,更清楚: ORDER BY  f1 ASC, f2  DESC 如果都用降序,必须用两个desc O ...

  5. Milking Cows

    Milking Cows Three farmers rise at 5 am each morning and head for the barn to milk three cows. The f ...

  6. 修改织梦默认提示"dedecms提示信息!"

    在使用dedecms搜索的时候如果搜索频率过快,经常会跳出一个提示窗口提示"管理员设定搜索时间间隔为*秒,请稍后再试!".怎么自定义Dedecms提示信息呢?让心存不轨的家伙少一个 ...

  7. Hadoop之伪分布环境搭建

    搭建伪分布环境 上传hadoop2.7.0编译后的包并解压到/zzy目录下 mkdir /zzy 解压 tar -zxvf hadoop.2.7.0.tar.gz -C /zzy     配置hado ...

  8. Spring各个jar包的简介

    spring.jar是包含有完整发布的单个jar 包,spring.jar中包含除了spring-mock.jar里所包含的内容外其它所有jar包的内容,因为只有在开发环境下才会用到 spring-m ...

  9. WPF 元素绑定

    1.什么是数据绑定数据绑定是一种关系,WPF程序从源对象中提取一些信息,并根据这些信息设置目标对象的属性,目标属性作为依赖项属性.源对象可以是任何内容,可以是另一个wpf内容,甚至是自行创建的纯数据对 ...

  10. VMware的四种网络连接方式

    mkdir  /mn/cdrom mount /dev/cdrom /mnt/cdrom Bridge:这种方式最简单,直接将虚拟网卡桥接到一个物理网卡上面,和linux下一个网卡 绑定两个不同地址类 ...