快速扫描文本文件，统计行数，并返回每一行的索引位置(Delphi、C#)

由项目需要，需要扫描1200万行的文本文件。经网友的指点与测试，发现C#与Delphi之间的差距并不大。不多说，列代码测试：

下面是Delphi的代码：


//遍历文件查找回车出现的次数

function ScanEnterFile(const FileName:string):TInt64Array;

var

  MyFile:TMemoryStream;//文件内存

  rArray:TInt64Array;       //行索引结果集

  size,curIndex:int64;//文件大小，当前流位置

  enterCount:int64;//回车数量

  DoLoop:Boolean;//是否继续循环

  pc: PChar;

  arrayCount:int64;//当前索引数组大小

  addStep:integer;//检测到回车字符串时需要添加的步进

begin

  if fileName = '' then

    Exit;

  if not FileExists(fileName) then

    Exit;

  MyFile:=TMemoryStream.Create;//创建流

  MyFile.LoadFromFile(fileName);//把流入口映射到MyFile对象

  size:=MyFile.Size;

  pc:=MyFile.Memory; //把字符指针指向内存流

  curIndex:=RowLeast;

  DoLoop:=true;

  enterCount:=0;

  setlength(rArray,perArray);

  arrayCount:=perArray;

  enterCount:=0;

  rArray[enterCount]:=0;

  while DoLoop do

  begin

    addStep:=0;

    if (ord(pc[curIndex])=13) then

      addStep:=2;

    if (ord(pc[curIndex])=10) then

      addStep:=1;

    //处理有回车的

    if (addStep<>0) then

    begin

      Application.ProcessMessages;

      //增加一行记录

      inc(enterCount);

      //判断是否需要增大数组

      if (enterCount mod perArray=0) then

      begin

        arrayCount:=arrayCount+perArray;

        setlength(rArray,arrayCount);

      end;

      rArray[enterCount]:=curIndex+addStep;

      curIndex:=curIndex+addStep+RowLeast;

    end

    else

      curIndex:=curIndex+2;

    if curIndex> size then

      DoLoop:=false

    else

      DoLoop:=true;

  end;

  result:=rArray;

  freeandnil(MyFile);

end;

执行代码：


procedure TMainForm.btn2Click(Sender: TObject);

var

  datasIndex:TInt64Array;//数据文件索引

begin

  t1:=GetTickCount;

  datasIndex:=ScanEnterFile('R:\201201_dataFile.txt');

  Caption:=Caption+'::'+inttostr(GetTickCount-t1); 

end;

执行结果是：16782 ms

下面是C#的代码：


        /// <summary>

        /// 扫描文本文件，进行行数的统计，并返回每一行的开始指针数组(1.2KW数据速度比使用数组的快10秒)

        /// </summary>

        /// <param name="fileName">文件名</param>

        /// <param name="rowCount">行数</param>

        /// <param name="rowLeast">一行最小长度</param>

        /// <param name="incCount">递增索引数组数量</param>

        /// <param name="initCount">首次初始化行索引数量</param>

        /// <returns>索引列表</returns>

        public static IList<long> ScanEnterFile(string fileName, out int rowCount, int rowLeast,ThreadProgress progress)

        {

            rowCount = 0;

            if (string.IsNullOrEmpty(fileName))

                return null;

            if (!System.IO.File.Exists(fileName))

                return null;

            FileStream myFile = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8);//把文件读入流

            IList<long> rList=new List<long>();

            int enterCount = 0;//回车数量

            int checkValue;

            int addStep;

            myFile.Position = rowLeast;

            checkValue = myFile.ReadByte();

            while (checkValue != -1)

            {

                //Application.DoEvents();

                addStep = -1;

                //由于文件ReadByte之后，其当前位置已经往后推移了移位。

                //因此，如果是回车的第一个字符，则要推移一位。

                //而如果是回车的第二个字符，则不用推移一位

                if (checkValue == 13)

                    addStep = 1;

                else if (checkValue == 10)

                    addStep = 0;

                if (addStep >= 0)

                {

                    enterCount++;

                    rList.Add(myFile.Position + addStep);

                    myFile.Seek(rowLeast + addStep, SeekOrigin.Current);

                    progress(enterCount);

                }

                else myFile.Seek(2, SeekOrigin.Current);

                checkValue = myFile.ReadByte();

            }

            rowCount = enterCount + 1;

            return rList;

        }

执行的代码：

            Stopwatch stopwatch = new Stopwatch();

            stopwatch.Start();

            int rowCount;

            FileHelper.ScanEnterFile(@"R:\201201_dataFile.txt", out rowCount, 35, outputProgress);

            useTime = stopwatch.ElapsedMilliseconds;

执行结果是：

124925 ms

（经过众多网友的批评与指点，该方法并没有把文件读取内存中，而是逐个字节地读取，速度比Delphi字节读进内存的方法要慢很多。这种方法只适合于老机器，内存不够的情况下，当今内存已经很便宜了，所以，该方法目前已经过时了，下面经过网友的指点，使用了readline的方法，速度大概是6秒左右。）


        public static IList<long> ScanEnterFile(string fileName, ThreadProgress progress)

        {

            if (string.IsNullOrEmpty(fileName))

                return null;

            if (!System.IO.File.Exists(fileName))

                return null;

            IList<long> rList = new List<long>();

            rList.Add(0);

            StreamReader sr = File.OpenText(fileName);

            string rStr = sr.ReadLine();

            while (null != rStr)

            {

                rList.Add(rList[rList.Count-1] + rStr.Length + 2);

                rStr = sr.ReadLine();

                progress(rList.Count);

            }

            sr.Close();

            return rList;

        }

经过测试，该方法如果存在中文字符编码的时候，其位置是错误的。日后找到解决方法后，再上来更新。

经过测试，C#的使用IList<T>比数组的要快。

总结：任何事物都有其存在的价值，至于看官门选什么，就根据自己的需要，来选择，这里，本人不会有任何偏向于哪一方。反正，能成事，什么都不重要了。

原创作品出自努力偷懒，转载请说明文章出处：http://blog.csdn.net/kfarvid或 http://www.cnblogs.com/kfarvid/

http://www.cnblogs.com/kfarvid/archive/2012/01/12/2320692.html

快速扫描文本文件，统计行数，并返回每一行的索引位置(Delphi、C#)的更多相关文章

C++->10.3.2-3，使用文件流类录入数据，并统计行数
题目:建立一个文本文件,从键盘录入一篇短文存放在该文件中短文由若干行构成,每行不超过80个字符,并统计行数. /* #include<iostream.h>#include<stdl ...
Hbase Java API包括协处理器统计行数
package com.zy; import java.io.IOException; import org.apache.commons.lang.time.StopWatch; import or ...
《c程序设计语言》读书笔记--统计行数、单词数、字符数
#include <stdio.h> int main() { int lin = 0,wor = 0,cha = 0; int flag = 0; int c; while((c = g ...
shell 统计行数
语法:wc [选项] 文件… 说明:该命令统计给定文件中的字节数.字数.行数.如果没有给出文件名,则从标准输入读取.wc同时也给出所有指定文件的总统计数.字是由空格字符区分开的最大字符串. 该命令各选 ...
linux、WINDOWS命令行下查找和统计行数
linux : 例子: netstat -an | grep TIME_WAIT | wc -l | 管道符 grep 查找命令 wc 统计命令 windows: 例子: netstat -an | ...
wc 统计行数字数
Linux统计文件行数 2011-07-17 17:32 by 依水间, 168255 阅读, 4 评论, 收藏, 编辑语法:wc [选项] 文件… 说明:该命令统计给定文件中的字节数.字数.行数. ...
SQL Server遍历所有表统计行数
DECLARE CountTableRecords CURSOR READ_ONLY FOR SELECT sst.name, Schema_name(sst.schema_id) FROM sys. ...
Python，针对指定文件类型，过滤空行和注释，统计行数
参考网络上代码编辑而成,无技术含量,可自行定制: 目前亲测有效,若有待完善之处,还望指出! 强调:将此统计py脚本放置项目的根目录下执行即可. 1.遍历文件,递归遍历文件夹中的所有 def getFi ...
oracle查询表统计行数与注释
SELECT TABLE_NAME,NUM_ROWS,(select COMMENTS from user_tab_comments WHERE TABLE_NAME=C.TABLE_NAME) FR ...

随机推荐

windows下用过VMware安装MAC OS X苹果系统
vmware怎么安装os x10.9?vmware 10安装mac os 10.9教程详解来源:互联网作者:佚名时间:10-30 13:50:20 [大中小] VMWare 虚拟机可以使你在 ...
post 提交数据
1 默认:application/x-www-form-urlencoded 在网页表单中可设置 enctype的值,如果不设,默认是 application/x-www-form-urlencode ...
Java对象的序列化与反序列化：默认格式及JSON格式（使用jackson）
我的技术博客经常被流氓网站恶意爬取转载.请移步原文:http://www.cnblogs.com/hamhog/p/3558663.html,享受整齐的排版.有效的链接.正确的代码缩进.更好的阅读体验 ...
VMware10.0.4下 CentOS 6.5 cmake安装 MySQL 5.5.32
一.准备工作 1.1.创建 zhuzz/tools目录 [root@localhost ~]# mkdir -p /home/zhuzz/tools [root@localhost ~]# cd /h ...
RX学习笔记：Bootstrap
Bootstrap https://getbootstrap.com 2016-07-01 在学习FreeCodeCamp课程中了解到Bootstrap,并于课程第一个实战题卡在响应式部分,于是先对B ...
JQuery上传控件 jUploader 使用
jUploader 1.0 Demo Download: jquery.jUploader-1.01.js 9.75kb Download: jquery.jUploader-1.01.min.js ...
九度OJ1184二叉树
题目描述: 编一个程序,读入用户输入的一串先序遍历字符串,根据此字符串建立一个二叉树(以指针方式存储).例如如下的先序遍历字符串:ABC##DE#G##F###其中“#”表示的是空格,空格字符代表空树 ...
PHP 提取图片img标记中的任意属性
PHP 提取图片img标记中的任意属性的简单实例. 复制代码代码如下: <?php /* PHP正则提取图片img标记中的任意属性 */ $str = '<center><im ...
在linux下安装memcacheq
#!/bin/bash mkdir ~/build cd ~/build wget http://download.oracle.com/berkeley-db/db-5.1.19.tar.gz .t ...
SQL技术内幕四
数据类型: sql server只接受两种数据类型 1. 普通字符 varchar char 用一个字节表示一个字符,表示英文 2.unicode nchar nvarchar 用两个字节表示一个 ...

快速扫描文本文件，统计行数，并返回每一行的索引位置(Delphi、C#)

快速扫描文本文件，统计行数，并返回每一行的索引位置(Delphi、C#)的更多相关文章

随机推荐

热门专题