ruby写一个文件内容相似性比较的代码
1.相似度定义
我们定义,则,我们设,则,|C|=s,则相似度p=,p(0,1)
2.相似度检测算法设计
算法设计:
定义4个字符为一个字符串,将T1,T2分割成若干字符串,若剩余字符不足4个,则以空格补全。将分割后的T1T2计数,记下|T1|=n,|T2|=m,s=0;在T1中取出第一字符串,检测是否在T2中,若存在,则s+1,并删除与被检测字符串相同的字符串,循环到T2检测,直到T2中不存在被检测的字符串,循环到T1,提出下一个被检测字符串,到T2中检测;如此循环检测,直到T1中的所有字符串都被检测或者T2中所有的字符串都被删除,停止,记下此时的s;将所得的s除以n和m中最大的那个数,所得的结果为T1,T2的相似度。先以T1为被检测模板,检测,然后再以T2为被检测模板检测,得出两个相似度的数,取最小值。
用ruby实现如下:
def fill_str(str,i=4)
return str if str.size%i == 0
str<<" "*(4-str.size%i)
end
def txt_cmp(f0,f1)
str_f0,str_f1 = fill_str(File.new(f0).read),fill_str(File.new(f1).read)
a0,a1 = str_f0.scan(/.{4}/m),str_f1.scan(/.{4}/m)
n,m,s = a0.size,a1.size,0
a0.each do |txt|
if a1.include?(txt)
size = a1.size
s+=size-a1.keep_if {|item| item!=txt}.size
end
break if a1.size == 0
end
s/[n,m].max.to_f
rescue =>e
puts "error : #{e.message}\n" << e.backtrace[0..2].join("\n")
end
(puts "you must cmp 2 txt file";exit) if ARGV.size != 2
r = txt_cmp(f0=ARGV[0],f1=ARGV[1])
puts "#{f0} and #{f1} semblance is #{r*100}%"
下面是4个文件分别为1.txt 2.txt a.txt b.txt,内容如下:
1.txt
NFC East rival quarterbacks Tony Romo(notes) of the Dallas Cowboys and Eli Manning(notes) of the New York Giants now have something else in common ḂẂ they've used the same wedding planner to help them tie the knot. Todd Fiscus, the man with the plan, set
up what he called "man food" at Dallas' Arlington Hall on Saturday, when Romo married former Miss Missouri Candace Crawford. "I have a lot of football players to feed," said Fiscus, who had pizza and short ribs on the menu.
However, Romo apparently put all the tunes together. "Tony picked out every song, and when it plays, and what the keynote things are," Fiscus said.
Sounds like a very orderly occasion, but there was one wild card ḂẂ whether Cowboys owner Jerry Jones would be able to attend. With the continued lockout, owners and players are not supposed to have any contact away from the negotiating table. But Jones received
special dispensation from the NFL to attend, just as the Green Bay Packers recently were informed that they will, in fact, receive their Super Bowl rings in a June 16 ceremony no matter what the labor situation is at that time. Jones was there along with virtually
all of Romo's teammates.
It is unknown whether Jones and Romo actually discussed any labor issues at the wedding ḂẂ we're guessing this was more of a "friendly", though Jones is one of the most powerful owners on the NFL's side of things and Romo's marquee value gives him a lot of
play on the other side.
"I've gotten special permission," Jones recently told ESPN's Ed Werder. "But more than anything, (I got the) right ticket from him and his fianceẀḊ ḂẂ Romo's wife-to-be. (It's) one of prettiest invitations I've ever seen.
"So, yes, I will be there and (I'm) proud for him. He's got the best end of this deal."
Romo, who had been linked romantically before with Jessica Simpson and Carrie Underwood, proposed to Crawford last December. Crawford's brother Chace is known for his role on the TV show "Gossip Girl' and has also been linked romantically with Underwood.
According to the new Mrs. Romo, the lockout may play a part in the couple's plans for a honeymoon; usually around this time of year, her husband would be participating in minicamps and other off-season workouts.
"This lockout has been quite a dent in the honeymoon idea," she told WFAA-TV. "We'll see. We haven't really gotten there yet. We're taking a day at a time with the lockout. We (are not) even sure if we're gonna get to go (on) one."
2.txt
Officially, Memorial Day, observed on the last Monday of May (this year it's May 30), honors the war dead. Unofficially, the day honors the start of summer. (More on that in a moment.)
The upcoming three-day weekend has prompted searches on Yahoo! for "when is memorial day," "what is memorial day," and "memorial day history." The day was originally known as "Decoration Day" because the day was dedicated to the Civil War dead, when mourners
would decorate gravesites as a remembrance.
The holiday was first widely observed on May 30, 1868, when 5,000 people helped decorate the gravesites of 20,000 Union and Confederate soldiers buried at Arlington National Cemetery. (Some parts of the South still remember members of the Confederate Army with
Confederate Memorial Day.)
After World War I, the observances were widened to honor the fallen from all American wars--and in 1971, Congress declared Memorial Day a national holiday.
Towns across the country now honor military personnel with services, parades, and fireworks. A national moment of remembrance takes place at 3 p.m. At Arlington National Cemetery, headstones are graced with small American flags.
This day is not to be confused with Veterans Day, which is observed on November 11 to honor military veterans, both alive and dead.
However, confusion abounds anyway, with the weekend marking for many the kickoff of summer, and it is reserved for weekend getaways, picnics, and sales. Searches on "memorial day sales," "memorial day recipes," and "memorial day weekend" are just some of the
lookups related to the festivities.
a.txt
23l4kj23 klgjdlskgj235 3lkj 0952ru lkfj lkqejfg
2t34lktj3409t uj34gjklejeglekjfdklsafjalsfj
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
sdgakdgjsdalgjaslfjsalkfjsadlf
b.txt
23l4kj23 klgjdlskgj235 3lkj 0952ru lkfj lkqejfg
2t34lktj3409t uj34gjklejeglekjfdklsafjalsfj
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
sdgakdgjsdalgjaslfjsalkfjsadlf
测试代码如下:
ruby -EISO-8859-14 txtcmp.rb 1.txt 2.txt
1.txt and 2.txt semblance is 8.653846153846153%
ruby txtcmp.rb a.txt b.txt
a.txt and b.txt semblance is 79.54545454545455%
因为1.txt中包含非utf-8字符,默认比较会出错,遂指定外部编码比较。
ruby写一个文件内容相似性比较的代码的更多相关文章
- 有一个很大的整数list,需要求这个list中所有整数的和,写一个可以充分利用多核CPU的代码,来计算结果(转)
引用 前几天在网上看到一个淘宝的面试题:有一个很大的整数list,需要求这个list中所有整数的和,写一个可以充分利用多核CPU的代码,来计算结果.一:分析题目 从题中可以看到“很大的List”以及“ ...
- 如何使用 js 写一个正常人看不懂的无聊代码
如何使用 js 写一个正常人看不懂的无聊代码 代码质量, 代码可读性, 代码可维护性, clean code WAT js WTF https://www.destroyallsoftware.com ...
- java控制多线程同时写一个文件
最近出现一个需求,大体要做的就是控制多线程同时操作一个文件.当时第一个反应是不要用synchronized,太low了,然后我就使用了读写锁ReentrantReadWriteLock,然后写完静下来 ...
- 使用Java的多线程和IO流写一个文件复制功能类
创建一个复制功能类,继承Thread类,重写run()方法,把FileInputStream和FileOutputStream输入输出流写在run()方法内.示例代码如下: import java.i ...
- Python小实验——读&写Excel文件内容
安装xlrd模块和xlwt模块 读取Excel文件了内容需要额外的模块-- \(xlrd\),在官网上可以找到下载:https://pypi.python.org/pypi/xlrd#download ...
- python利用socket写一个文件上传
1.先将一张图片拖入‘文件上传’的目录下,利用socket把这张图片写到叫‘yuan’的文件中 2.代码: #模拟服务端 import subprocess import os import sock ...
- linux上将另一个文件内容快速写入正在编辑的文件内
一.我们看到 www 目录下有两个文件.like.php 内有一行字母,而 loo.php 内什么也没有. 二 .我们来编辑 loo.php. 三.用下面的命令将 like.php 的内容复制到 lo ...
- bash shell 合并多个文件内容到一个文件、查看多少行代码
一.简单版: $ cat **/* > merge.fuck 二.结合find + xargs + cat版本: $ find ./ -iregex '.*\.\(js\|scss\|tpl\) ...
- 用Python写一个将Python2代码转换成Python3代码的批处理工具
之前写过一篇如何在windows操作系统上给.py文件添加一个快速处理的右键功能的文章:<一键将Python2代码自动转化为Python3>,作用就是为了将Python2的文件升级转换成P ...
随机推荐
- 【安卓网络请求开源框架Volley源码解析系列】定制自己的Request请求及Volley框架源码剖析
通过前面的学习我们已经掌握了Volley的基本用法,没看过的建议大家先去阅读我的博文[安卓网络请求开源框架Volley源码解析系列]初识Volley及其基本用法.如StringRequest用来请求一 ...
- Xdoclet + Ant自动生成Hibernate配置文件
在使用Hibernate的时候,过多的Hibernate配置文件是一个让人头疼的问题.最近接触了Xdoclet这个工具.它实际上就是一个自动代码生成的工具,Xdoclet不能单独运行,必须搭配其他工具 ...
- Sqoop执行mysql删除语句
如果使用Sqoop删除mysql中的数据,并且传递动态日期参数,则使用下方的方法: 创建一个sh文件,内容如下: #!/bin/sh ## 环境变量生效 . /etc/profile #[调度删除导入 ...
- HDFS HA: 高可靠性分布式存储系统解决方案的历史演进
1. HDFS 简介 HDFS,为Hadoop这个分布式计算框架提供高性能.高可靠.高可扩展的存储服务.HDFS的系统架构是典型的主/从架构,早期的架构包括一个主节点NameNode和多个从节点Da ...
- JAVA之旅(二十五)——文件复制,字符流的缓冲区,BufferedWriter,BufferedReader,通过缓冲区复制文件,readLine工作原理,自定义readLine
JAVA之旅(二十五)--文件复制,字符流的缓冲区,BufferedWriter,BufferedReader,通过缓冲区复制文件,readLine工作原理,自定义readLine 我们继续IO上个篇 ...
- 【leetcode79】Single Number III
题目描述: 给定一个数组,里面只有两个数组,只是出现一次,其余的数字都是出现两次,找出这个两个数字,数组形式输出 原文描述: Given an array of numbers nums, in wh ...
- H5学习之旅-H5的框架(13)
H5框架语法介绍 :frame:框架对于页面的设计有很大的作用 frameSet:框架集标签定义如何将窗口分割为框架 ,每一个frameset定义一些列行或者列,rowS/COLS规定了每行或者每列占 ...
- [案例]某体育用品公司在零售领域BI的产品应用解决方案
随着某体育用品公司集团经营规模的不断扩大,信息化的建设也在不断的深入,从POS系统到ERP系统,从MAIL系统到OA系统,整个集团的每项工作都与信息系统密不可分,可以说是行业内信息化建设的先导者.但是 ...
- C#之结尾篇
在Top10语言中,C#是最优美的语言,没有之一,在Top10语言中,C#所可用的标准库及可获得其他库是最强大的之一,这个必须带上之一,因为有java在,在Top语言中,C#语言是性能最高的语言之一, ...
- 浅析GDAL库C#版本支持中文路径问题
GDAL库对于C#的支持问题还是蛮多的,对于中文路径的支持就是其中之一(另一个就是通过OGR库获取图形的坐标信息). 关于C#支持中文路径,看过我之前博客的应该都不陌生,如果使用的是我修改过的GDAL ...