[Python学习笔记-008] 使用双向链表去掉重复的文本行

用Python处理文本文件是极方便的，当文本文件中有较多的重复的行的时候，将那些重复的行数去掉并打印诸如"...<repeats X times>..."有助于更好的浏览文本文件的内容。下面将通过Python打造一个双向链表来实现这一功能。如果你对在Python中实现双向链表感兴趣，不妨花五分钟读一读。Have fun :-)

01 - 定义链表结点

 struct node {

     int           lineno;

     char          *line;

     char          *md5;

     char          *dupcnt; /* duplicated counter */

     struct node   *prev;

     struct node   *next;

 };

在Python3中，可以使用字典定义这样的结点。例如：

 node = {}

 node['lineno'] = index + 1

 node['line'] = line.strip().rstrip()

 node['md5'] = md5txt

 node['dupcnt'] = 0

 node['prev'] = index - 1

 node['next'] = index + 1

由于Python的list本身就是可变数组，这就省力多了，我们不需要从C的角度去考虑链表的建立。

02 - 初始化双向链表

 def init_doubly_linked_list(l_in):

     l_out = []

     index = 0

     for text in l_in:

         data = text.strip().rstrip()

         md5 = hashlib.md5(data.encode(encoding='UTF-8')).hexdigest()

         d_node = {}

         d_node['lineno'] = index + 1

         d_node['line'] = data

         d_node['md5'] = md5

         d_node['dupcnt'] = 0

         d_node['prev'] = index - 1

         d_node['next'] = index + 1

         if index == 0:

             d_node['prev'] = None

         if index == len(l_in) - 1:

             d_node['next'] = None

         l_out.append(d_node)

         index += 1

     return l_out

很简单，直接采用尾插法搞定。

03 - 将双向链表中的包含有重复行的结点处理掉

 def omit_doubly_linked_list(l_dll):

     for curr_node in l_dll:

         prev_node_index = curr_node['prev']

         next_node_index = curr_node['next']

         if prev_node_index is None:  # the head node

             prev_node = None

             continue

         else:

             prev_node = l_dll[prev_node_index]

         if next_node_index is None:  # the tail node

             next_node = None

         else:

             next_node = l_dll[next_node_index]

         if curr_node['md5'] != prev_node['md5']:

             continue

         # Update dupcnt of previous node

         prev_node['dupcnt'] += 1

         # Remove current node

         if next_node is not None:

             next_node['prev'] = curr_node['prev']

         if prev_node is not None:

             prev_node['next'] = curr_node['next']

如果当前行的md5跟前一行一样，那说明就重复了。处理的方法如下：

将前一个结点的重复计数器（dupcnt）加1；
把当前结点从双向链表上摘掉（这里我们只修改前驱结点的next和后继结点的prev, 不做实际的删除，因为没必要）。

也许你会问为什么采用md5比较而不采用直接的文本行比较，个人觉得先把文本行的md5算出后，再使用md5比较会更好一些，尤其是文本行很长的时候，因为md5（占128位）的输出总是32个字符。

04 - 遍历处理后的双向链表

 def traverse_doubly_linked_list(l_dll):

     l_out = []

     node_index = None

     if len(l_dll) > 0:

         node_index = 0

     while (node_index is not None):  # <==> p != NULL

         curr_node = l_dll[node_index]

         msg = '%6d\t%s' % (curr_node['lineno'], curr_node['line'])

         l_out.append(msg)

         #

         # 1) If dupcnt is 0, it means subsequent lines don't repeat current

         #    line, just go to visit the next node

         # 2) If dupcnt >= 1, it means subsequent lines repeat the current line

         #    a) If dupcnt is 1, i.e. only one line repeats, just pick it up

         #    b) else save message like '...<repeats X times>...'

         #

         if curr_node['dupcnt'] == 0:

             node_index = curr_node['next']

             continue

         elif curr_node['dupcnt'] == 1:

             msg = '%6d\t%s' % (curr_node['lineno'] + 1, curr_node['line'])

         else:  # i.e. curr_node['dupcnt'] > 1

             msg = '%s\t...<repeats %d times>...' % (' ' * 6,

                                                     curr_node['dupcnt'])

         l_out.append(msg)

         node_index = curr_node['next']

     return l_out

如果当前结点的dupcnt为0，说明它后面的行与之不同，直接打印；
如果当前结点的dupcnt为1，说明它后面的行与之相同，那么打印当前行，再打印下一行，注意行号得加一；
如果当前结点的dupcnt为N(>1)，说明它后面有N行与之重复了，那么打印当前行并再打印...<repeates N times>...。

注意：头结点的prev和尾结点的next都被定义为None。我们因此可以做类C的遍历。典型的C遍历链表是这样的：

for (p = head; p != NULL; p = p->next)

    /* print p->data */

到此为止，在Python中实现一个简单的双向链表就搞定了。其特点是：

用None代表NULL；
头结点的prev指针的值和尾结点的next指针的值均为None；
中间结点的prev指针的值是其前趋结点的下标；
中间结点的next指针的值是其后继结点的下标。

完整的代码实现如下：

 #!/usr/bin/python3

 import sys

 import hashlib

 import getopt

 TC_LOG_OUTPUT_RAW = False

 def init_doubly_linked_list(l_in):

     #

     # Here is the node definition of the doubly linked list

     #

     #   struct node {

     #       int           lineno;

     #       char          *text;

     #       char          *md5;

     #       char          *dupcnt; /* duplicated counter */

     #       struct node   *prev;

     #       struct node   *next;

     #   }

     #

     l_out = []

     index = 0

     for text in l_in:

         data = text.strip().rstrip()

         md5 = hashlib.md5(data.encode(encoding='UTF-8')).hexdigest()

         d_node = {}

         d_node['lineno'] = index + 1

         d_node['line'] = data

         d_node['md5'] = md5

         d_node['dupcnt'] = 0

         d_node['prev'] = index - 1

         d_node['next'] = index + 1

         if index == 0:

             d_node['prev'] = None

         if index == len(l_in) - 1:

             d_node['next'] = None

         l_out.append(d_node)

         index += 1

     return l_out

 def omit_doubly_linked_list(l_dll):

     #

     # Core algorithm to omit repeated lines saved in the doubly linked list

     #

     #   prev_node = curr_node->prev;

     #   next_node = curr_node->next;

     #

     #   if (curr_node->md5 == prev_node.md5) {

     #       prev_node.dupcnt++;

     #

     #       /* remove current node */

     #       next_node->prev = curr_node->prev;

     #       prev_node->next = curr_node->next;

     #   }

     #

     for curr_node in l_dll:

         prev_node_index = curr_node['prev']

         next_node_index = curr_node['next']

         if prev_node_index is None:  # the head node

             prev_node = None

             continue

         else:

             prev_node = l_dll[prev_node_index]

         if next_node_index is None:  # the tail node

             next_node = None

         else:

             next_node = l_dll[next_node_index]

         if curr_node['md5'] != prev_node['md5']:

             continue

         # Update dupcnt of previous node

         prev_node['dupcnt'] += 1

         # Remove current node

         if next_node is not None:

             next_node['prev'] = curr_node['prev']

         if prev_node is not None:

             prev_node['next'] = curr_node['next']

 def traverse_doubly_linked_list(l_dll):

     #

     # Core algorithm to traverse the doubly linked list

     #

     #   p = l_dll;

     #   while (p != NULL) {

     #       /* print p->lineno and p->text */

     #

     #       if (p->dupcnt == 0) {

     #           p = p->next;

     #           continue;

     #       }

     #

     #       if (p->dupcnt == 1)

     #           /* print p->lineno + 1 and p->text */

     #       else /* i.e. > 1 */

     #           printf("...<repeats %d times>...", p->dupcnt);

     #

     #       p = p->next;

     #   }

     #

     l_out = []

     node_index = None

     if len(l_dll) > 0:

         node_index = 0

     while (node_index is not None):  # <==> p != NULL

         curr_node = l_dll[node_index]

         msg = '%6d\t%s' % (curr_node['lineno'], curr_node['line'])

         l_out.append(msg)

         #

         # 1) If dupcnt is 0, it means subsequent lines don't repeat current

         #    line, just go to visit the next node

         # 2) If dupcnt >= 1, it means subsequent lines repeat the current line

         #    a) If dupcnt is 1, i.e. only one line repeats, just pick it up

         #    b) else save message like '...<repeats X times>...'

         #

         if curr_node['dupcnt'] == 0:

             node_index = curr_node['next']

             continue

         elif curr_node['dupcnt'] == 1:

             msg = '%6d\t%s' % (curr_node['lineno'] + 1, curr_node['line'])

         else:  # i.e. curr_node['dupcnt'] > 1

             msg = '%s\t...<repeats %d times>...' % (' ' * 6,

                                                     curr_node['dupcnt'])

         l_out.append(msg)

         node_index = curr_node['next']

     return l_out

 def print_refined_text(l_lines):

     l_dll = init_doubly_linked_list(l_lines)

     omit_doubly_linked_list(l_dll)

     l_out = traverse_doubly_linked_list(l_dll)

     for line in l_out:

         print(line)

 def print_raw_text(l_lines):

     lineno = 0

     for line in l_lines:

         lineno += 1

         line = line.strip().rstrip()

         print('%6d\t%s' % (lineno, line))

 def usage(prog):

     sys.stderr.write('Usage: %s [-r] <logfile>\n' % prog)

 def main(argc, argv):

     shortargs = ":r"

     longargs = ["raw"]

     try:

         options, rargv = getopt.getopt(argv[1:], shortargs, longargs)

     except getopt.GetoptError as err:

         sys.stderr.write("%s\n" % str(err))

         usage(argv[0])

         return 1

     for opt, arg in options:

         if opt in ('-r', '--raw'):

             global TC_LOG_OUTPUT_RAW

             TC_LOG_OUTPUT_RAW = True

         else:

             usage(argv[0])

             return 1

     rargc = len(rargv)

     if rargc < 1:

         usage(argv[0])

         return 1

     logfile = rargv[0]

     with open(logfile, 'r') as file_handle:

         if TC_LOG_OUTPUT_RAW:

             print_raw_text(file_handle.readlines())

         else:

             print_refined_text(file_handle.readlines())

     return 0

 if __name__ == '__main__':

     sys.exit(main(len(sys.argv), sys.argv))

测试运行如下：

$ ./foo.py /tmp/a.log > /tmp/a && cat /tmp/a

         <<<test_start>>>

         tag=dio30 stime=

         cmdline="diotest6 -b 65536 -n 100 -i 100 -o 1024000"

         contacts=""

         analysis=exit

         <<<test_output>>>

         diotest06      TPASS  :  Read with Direct IO, Write without

         diotest06      TFAIL  :  diotest6.c:: readv failed, ret =

         diotest06      TFAIL  :  diotest6.c:: Write Direct-child  failed

        diotest06      TPASS  :  Read with Direct IO, Write without

    ...<repeats  times>...

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

    ...<repeats  times>...

        diotest06      TPASS  :  Read, Write with Direct IO

        diotest06      TINFO  :  / testblocks failed

        incrementing stop

        <<<execution_status>>>

        initiation_status="ok"

        duration= termination_type=exited termination_id= corefile=no

        cutime= cstime=

        <<<test_end>>>

$ ./foo.py -r /tmp/a.log > /tmp/b && cat /tmp/b

         <<<test_start>>>

         tag=dio30 stime=

         cmdline="diotest6 -b 65536 -n 100 -i 100 -o 1024000"

         contacts=""

         analysis=exit

         <<<test_output>>>

         diotest06      TPASS  :  Read with Direct IO, Write without

         diotest06      TFAIL  :  diotest6.c:: readv failed, ret =

         diotest06      TFAIL  :  diotest6.c:: Write Direct-child  failed

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

        diotest06      TPASS  :  Read with Direct IO, Write without

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

        diotest06      TFAIL  :  diotest6.c:: Write with Direct IO, Read without

        diotest06      TPASS  :  Read, Write with Direct IO

        diotest06      TINFO  :  / testblocks failed

        incrementing stop

        <<<execution_status>>>

        initiation_status="ok"

        duration= termination_type=exited termination_id= corefile=no

        cutime= cstime=

        <<<test_end>>>

用meld对照/tmp/a和/tmp/b截图如下：

[Python学习笔记-008] 使用双向链表去掉重复的文本行的更多相关文章

Python学习笔记008
while循环 while 条件 : 执行 num =1 while num<=10: print(num) num+=1 1-100偶数方法1 num =2 while num& ...
OpenCV之Python学习笔记
OpenCV之Python学习笔记直都在用Python+OpenCV做一些算法的原型.本来想留下发布一些文章的,可是整理一下就有点无奈了,都是写零散不成系统的小片段.现在看到一本国外的新书< ...
Python学习笔记（四）
Python学习笔记(四) 作业讲解编码和解码 1. 作业讲解重复代码瘦身 # 定义地图 nav = {'省略'} # 现在所处的层 current_layer = nav # 记录你去过的地方 ...
【python学习笔记】3.字符串使用
[python学习笔记]3.字符串使用字符串是一种序列,素有标准的序列操作对字符串用样适用,字符串是不可以改变格式化操作符,%,左侧是格式化字符串,右侧是被格式的值,可以是一个值.元组.字典数值 ...
python学习笔记(二)、字符串操作
该一系列python学习笔记都是根据<Python基础教程(第3版)>内容所记录整理的 1.字符串基本操作所有标准序列操作(索引.切片.乘法.成员资格检查.长度.最小值和最大值)都适用于 ...
python学习笔记(一)、列表和元祖
该一系列python学习笔记都是根据<Python基础教程(第3版)>内容所记录整理的 1.通用的序列操作有几种操作适用于所有序列,包括索引.切片.相加.相乘和成员资格检查.另外,Pyt ...
Deep learning with Python 学习笔记（10）
生成式深度学习机器学习模型能够对图像.音乐和故事的统计潜在空间(latent space)进行学习,然后从这个空间中采样(sample),创造出与模型在训练数据中所见到的艺术作品具有相似特征的新作品 ...
Deep learning with Python 学习笔记（9）
神经网络模型的优化使用 Keras 回调函数使用 model.fit()或 model.fit_generator() 在一个大型数据集上启动数十轮的训练,有点类似于扔一架纸飞机,一开始给它一点推 ...
Deep learning with Python 学习笔记（8）
Keras 函数式编程利用 Keras 函数式 API,你可以构建类图(graph-like)模型.在不同的输入之间共享某一层,并且还可以像使用 Python 函数一样使用 Keras 模型.Ker ...

随机推荐

Python【day 10】函数进阶-小结
本节主要内容1.动态参数 *args **kwargs 形参:*args将多个位置参数聚合打包成元组 **kwargs将多个关键字参数聚合打包成字典实参:*li1将列表进行解包打散成多个位置参数 * ...
tf.reduce_sum（） and tf.where（）的用法
import tensorflow as tfimport numpy as npsess=tf.Session()a=np.ones((5,6))c=tf.cast(tf.reduce_sum(a, ...
微信小程序滚动tab的实现
微信小程序滚动tab的实现 1.目的:为了解决滚动版本的tab,以及实现虹吸效果. 2.方案:自己动手写了一个Demo,用于测试实现如上的效果.其代码有做参考,在这里先声明.具体的参照如下:重庆大学二 ...
几种高效的Java工具类推荐
本文将介绍了十二种常用的.高效的Java工具类在Java中,工具类定义了一组公共方法,这篇文章将介绍Java中使用最频繁及最通用的Java工具类. 在开发中,使用这些工具类,不仅可以提高编码效率,还 ...
odoo10学习笔记十：Actions
转载请注明原文地址:https://www.cnblogs.com/ygj0930/p/11189319.html actions定义了系统对于用户的操作的响应:登录.按钮.选择项目等. 一:窗口ac ...
net core 2 读取appsettings.json
问: .Net Core: Application startup exception: System.IO.FileNotFoundException: The configuration file ...
2019 China Collegiate Programming Contest Qinhuangdao Onsite
传送门 D - Decimal 题意: 询问$\frac{1}{n}$是否为有限小数. 思路: 拆质因子,看是不是只包含2和5即可,否则除不尽. Code #include <bits/st ...
XSS-Stored
存储型XSS (持久性XSS) 将恶意JavaScript代码存储在数据库,当下次用户浏览的时候执行 Low <?php if( isset( $_POST[ 'btnSign' ] ) ) { ...
zz视频分割在移动端的算法进展综述
视频分割在移动端的算法进展综述语义分割任务要求给图像上的每一个像素赋予一个带有语义的标签,视频语义分割任务是要求给视频中的每一帧图像上的每一个像素赋予一个带有语义的标签. 视频分割是一项广泛使用的技 ...
NOIP 2004 合并果子
洛谷P1090 https://www.luogu.org/problemnew/show/P1090 JDOJ 1270 题目描述在一个果园里,多多已经将所有的果子打了下来,而且按果子的不同种类分 ...

[Python学习笔记-008] 使用双向链表去掉重复的文本行

[Python学习笔记-008] 使用双向链表去掉重复的文本行的更多相关文章

随机推荐

热门专题