我们运行看结果

安装包sklearn

安装numpy

安装scipy

终于可以啦

我们把安装的包都写在文件里面吧

4行4列 轴对称 只需要看一半就可以 横着看 竖着看都行 数值越接近1 表示越相似

我们通过这个可以将新的新闻(还未加入数据库的新闻)放在左上角 然后mongodb存的老新闻和他比较 如果超一定值

比如0.8 表示相似度高 我们就帮他当成一个新闻 那么这个左上角新的新闻

就会被踢掉  如果相似度很低 说明是新的新闻 那么就 执行命令加入mongodb中来  大概这个意思

0.12693309表示的是0号文件和1号文件的相似度

我们来试验一个比较长的

里面有特殊字符 不加注释出爆NO-ASXIAL码的问题

# -*- coding: utf- -*-

from sklearn.feature_extraction.text import TfidfVectorizer

news1 = """
(CNN)President Donald Trump on Saturday again attacked a federal judge whose decision he disliked, blasting Judge James Robart, a George W. Bush appointee who temporarily stopped his controversial travel ban Friday night.
Trump's increasingly heated responses quickly drew objections from Democrats, who said he was improperly attacking an independent judiciary. By Saturday afternoon, Trump had stepped up his criticism: "Because the ban was lifted by a judge, many very bad and dangerous people may be pouring into our country. A terrible decision."
Shortly after a.m. ET, the President tweeted, "The opinion of this so-called judge, which essentially takes law-enforcement away from our country, is ridiculous and will be overturned."
The opinion of this so-called judge, which essentially takes law-enforcement away from our country, is ridiculous and will be overturned!
— Donald J. Trump (@realDonaldTrump) February ,
That tweet was one of several Trump issued Saturday morning in which he defended his executive order on immigration, which bars citizens of seven Muslim-majority countries from entering the US for days, all refugees for days and indefinitely halts refugees from Syria.
RELATED: James Robart: things to know about judge who blocked travel ban
"When a country is no longer able to say who can, and who cannot , come in & out, especially for reasons of safety &.security - big trouble," Trump next tweeted.
When a country is no longer able to say who can, and who cannot , come in & out, especially for reasons of safety &.security - big trouble!
— Donald J. Trump (@realDonaldTrump) February ,
"Interesting that certain Middle-Eastern countries agree with the ban. They know if certain people are allowed in it's death & destruction," he added, though he didn't name any countries.
Interesting that certain Middle-Eastern countries agree with the ban. They know if certain people are allowed in it's death & destruction!
— Donald J. Trump (@realDonaldTrump) February ,
Saturday afternoon, Trump resumed his criticism, tweeting: "What is our country coming to when a judge can halt a Homeland Security travel ban and anyone, even with bad intentions, can come into U.S.?"
What is our country coming to when a judge can halt a Homeland Security travel ban and anyone, even with bad intentions, can come into U.S.?
— Donald J. Trump (@realDonaldTrump) February ,
He followed up with, "Because the ban was lifted by a judge, many very bad and dangerous people may be pouring into our country. A terrible decision."
Because the ban was lifted by a judge, many very bad and dangerous people may be pouring into our country. A terrible decision
— Donald J. Trump (@realDonaldTrump) February ,
And he was still tweeting about it early Saturday evening: "Why aren't the lawyers looking at and using the Federal Court decision in Boston, which is at conflict with ridiculous lift ban decision?"
Why aren't the lawyers looking at and using the Federal Court decision in Boston, which is at conflict with ridiculous lift ban decision?
— Donald J. Trump (@realDonaldTrump) February ,
Trump was referring to a decision by a federal judge in Boston earlier Friday, a more limited ruling that declined to renew a temporary restraining order in Massachusetts. It would have prohibited the detention or removal of foreign travelers legally authorized to come to the Boston area, and the decision represented the Trump administration's first court victory regarding the order.
Unusual criticism
It is highly unusual for a President to publicly criticize a federal judge, but during the campaign, Trump memorably railed against Judge Gonzalo Curiel, who was overseeing a lawsuit against Trump University. Trump said Curiel, who was born in Indiana, was unable to fairly preside over the lawsuit because of his "Mexican heritage." Trump had introduced plans to build a wall along the Mexican border and take a hard stance on immigration.
Vice President Mike Pence later defended Trump in an interview with ABC News' George Stephanopoulos.
"Is it right for the President to say 'so-called' judge'? Doesn't that undermine the separation of powers in the Constitution?" Stephanopoulos asked Pence on "This Week" in a clip released Saturday afternoon.
"I don't think it does," Pence replied. "I think the American people are very accustomed to this president speaking his mind and speaking very straight with them."
ABC Breaking News | Latest News Videos
But Democrats pounced on Trump's criticism of Robart, with Democratic senators flatly saying the President's comments will factor into the confirmation hearings for Supreme Court nominee Neil Gorsuch.
"Attack on federal judge from POTUS is beneath the dignity of that office. That attitude can lead America to calamity," Washington Gov. Jay Inslee tweeted Saturday.
Attack on federal judge from POTUS is beneath the dignity of that office. That attitude can lead America to calamity.
— Governor Jay Inslee (@GovInslee) February ,
"The President's attack on Judge James Robart, a Bush appointee who passed with 99 votes, shows a disdain for an independent judiciary that doesn't always bend to his wishes and a continued lack of respect for the Constitution, making it more important that the Supreme Court serve as an independent check on the administration," Senate Minority Leader Chuck Schumer said in a statement.
"With each action testing the Constitution, and each personal attack on a judge, President Trump raises the bar even higher for Judge Gorsuch's nomination to serve on the Supreme Court. His ability to be an independent check will be front and center throughout the confirmation process."
Vermont. Sen. Patrick Leahy, the ranking member of the Judiciary Committee, said Trump's "hostility toward the rule of law is not just embarrassing, it is dangerous."
"We need a nominee for the Supreme Court willing to demonstrate he or she will not cower to an overreaching executive. This makes it even more important that Judge Gorsuch, and every other judge this president may nominate, demonstrates the ability to be an independent check and balance on an administration that shamefully and harmfully seems to reject the very concept."
Robart's order on Friday was a significant setback to Trump's ban and set up the nation for a second straight weekend of confusion about the policy's legality.
The White House said Friday the Department of Justice will challenge the decision. In a statement, White House press secretary Sean Spicer initially called Robart's order "outrageous" before quickly issuing another statement that dropped that word.
Robart has presided in the US District Court for the Western District of Washington state since . He assumed senior status in .
"""
news2 = """
President Donald Trump on Saturday again attacked a federal judge whose decision he disliked, criticizing Judge James Robart, a George W. Bush appointee who temporarily stopped his controversial travel ban Friday night.
President Trump’s attacks quickly drew objections from Democrats, who said he was attacking an independent judiciary. And by Saturday afternoon, President Trump was openly accusing Robart of potentially allowing “many very bad and dangerous people” to flow into the US and warning of dire consequences if the executive order is not enforced.
He also said, “What is our country coming to when a judge can halt a Homeland Security ban and anyone, even with bad intentions, can come into the U.S.?”
What is our country coming to when a judge can halt a Homeland Security travel ban and anyone, even with bad intentions, can come into U.S.?
— Donald J. Trump (@realDonaldTrump) February ,
Shortly after a.m. ET, the President tweeted, “The opinion of this so-called judge, which essentially takes law-enforcement away from our country, is ridiculous and will be overturned.”
The opinion of this so-called judge, which essentially takes law-enforcement away from our country, is ridiculous and will be overturned!
— Donald J. Trump (@realDonaldTrump) February ,
The tweet was one of several President Trump issued Saturday morning in which he defended his executive order on immigration, which bars citizens of seven Muslim-majority countries from entering the US for days, all refugees for days and indefinitely halts refugees from Syria.
“When a country is no longer able to say who can, and who cannot , come in & out, especially for reasons of safety &.security – big trouble,” President Trump next tweeted.
When a country is no longer able to say who can, and who cannot , come in & out, especially for reasons of safety &.security – big trouble!
— Donald J. Trump (@realDonaldTrump) February ,
“Interesting that certain Middle-Eastern countries agree with the ban. They know if certain people are allowed in it’s death & destruction,” he added, though he didn’t name any countries.
Interesting that certain Middle-Eastern countries agree with the ban. They know if certain people are allowed in it's death & destruction!
— Donald J. Trump (@realDonaldTrump) February ,
Saturday afternoon, President Trump resumed his criticism, tweeting: “What is our country coming to when a judge can halt a Homeland Security travel ban and anyone, even with bad intentions, can come into U.S.?”
He followed up with, “Because the ban was lifted by a judge, many very bad and dangerous people may be pouring into our country. A terrible decision.”
Because the ban was lifted by a judge, many very bad and dangerous people may be pouring into our country. A terrible decision
— Donald J. Trump (@realDonaldTrump) February ,
It is highly unusual for a President to publicly criticize a federal judge, but during the campaign, President Trump memorably railed against Judge Gonzalo Curiel, who was overseeing a lawsuit against Trump University. President Trump said Curiel, who was born in Indiana, was unable to fairly preside over the lawsuit because of his “Mexican heritage.” President Trump had introduced plans to build a wall along the Mexican border and take a hard stance on immigration.
Democrats pounced on President Trump’s criticism of Robart, with Democratic senators flatly saying the President’s comments will factor into the confirmation hearings for Supreme Court nominee Neil Gorsuch.
“Attack on federal judge from POTUS is beneath the dignity of that office. That attitude can lead America to calamity,” Washington Gov. Jay Inslee tweeted Saturday.
Attack on federal judge from POTUS is beneath the dignity of that office. That attitude can lead America to calamity.
— Governor Jay Inslee (@GovInslee) February ,
“The President’s attack on Judge James Robart, a Bush appointee who passed with votes, shows a disdain for an independent judiciary that doesn’t always bend to his wishes and a continued lack of respect for the Constitution, making it more important that the Supreme Court serve as an independent check on the administration,” Senate Minority Leader Chuck Schumer said in a statement.
“With each action testing the Constitution, and each personal attack on a judge, President Trump raises the bar even higher for Judge Gorsuch’s nomination to serve on the Supreme Court. His ability to be an independent check will be front and center throughout the confirmation process.”
Vermont. Sen. Patrick Leahy, the ranking member of the Judiciary Committee, said President Trump’s “hostility toward the rule of law is not just embarrassing, it is dangerous.”
“We need a nominee for the Supreme Court willing to demonstrate he or she will not cower to an overreaching executive. This makes it even more important that Judge Gorsuch, and every other judge this president may nominate, demonstrates the ability to be an independent check and balance on an administration that shamefully and harmfully seems to reject the very concept.”
Robart’s order on Friday was a significant setback to President Trump’s ban and set up the nation for a second straight weekend of confusion about the policy’s legality.
The White House said Friday the Department of Justice will challenge the decision. In a statement, White House press secretary Sean Spicer initially called Robart’s order “outrageous” before quickly issuing another statement that dropped that word.
Robart has presided in the US District Court for the Western District of Washington state since . He assumed senior status in .
""" documents = [news1, news2] tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_sim = tfidf * tfidf.T print pairwise_sim.A

test02

上面是2个新闻 都是Trump的新闻 第一个是cnn的 第二个是其他网站的

我们发现相似度相当高 0.96

那么我就可以认为他俩是一个新闻

下面我就用这个算法来查重

week07 13.3 NewsPipeline之 三News Deduper之 tf_idf 查重的更多相关文章

  1. week07 13.4 NewsPipeline之 三 News Deduper

    还是循环将Q2中的东西拿出来 然后查重(去mongodb里面把一天之内的新闻都拿出来,然后把拿到的新的新闻和mongodb里一天内的新闻组一个 tf-idf的对比)可看13.3 相似度检查 如果超过一 ...

  2. week07 13.1 NewsPipeline之 一 NewsMonitor

    我们要重构一下代码 因为我们之前写了utils 我们的NewsPipeline部分也要用到 所以我们把他们单独独立得拿出来 删掉原来的 将requirements.txt也拿出去 现在我们搬家完成 我 ...

  3. week07 13.2 NewsPipeline之 二 News Fetcher - Xpath

    我们使用Xpath来专门做一个scrapter 我们专门弄个文件夹 里面全部是 各个新闻源(CNN BBC等)的scraper来抓取网站的text内容 主要函数(就是传入text内容的那个url)然后 ...

  4. C# 基础知识系列-13 常见类库(三)

    0. 前言 在<C# 基础知识系列- 13 常见类库(二)>中,我们介绍了一下DateTime和TimeSpan这两个结构体的内容,也就是C#中日期时间的简单操作.本篇将介绍Guid和Nu ...

  5. Python for Infomatics 第13章 网页服务三(译)

    注:文章原文为Dr. Charles Severance 的 <Python for Informatics>.文中代码用3.4版改写,并在本机测试通过. 13.6 应用程序接口API 现 ...

  6. 13 (H5*) JS第三天 数组、函数

    目录 1:数组的定义和创建方式 2:数组的总结 3:for循环遍历数组 4:数组的案例 5:冒泡排序 6:函数的定义 7:函数的参数 8:函数的返回值 复习 <script> /* * * ...

  7. 外包公司派遣到网易,上班地点网易大厦,转正后工资8k-10k,13薪,包三餐,值得去吗?

    作为一个人,我们必须时时刻刻清醒地看待自己,做到不卑不亢才能坚强地活下去. 请肆无忌惮地点赞吧,微信搜索[沉默王二]关注这个在九朝古都洛阳苟且偷生的程序员.本文 GitHub github.com/i ...

  8. 2018/2/13 ElasticSearch学习笔记三 自动映射以及创建自动映射模版,ElasticSearch聚合查询

    终于把这些命令全敲了一遍,话说ELK技术栈L和K我今天花了一下午全部搞定,学完后还都是花式玩那种...E却学了四天(当然主要是因为之前上班一直没时间学,还有安装服务时出现的各种error真是让我扎心了 ...

  9. 深入刨析tomcat 之---第13篇 tomcat的三种部署方法

    writedby 张艳涛 一般我们都知道将web 应用打成war包,放到tomcat的webapp目录下,就是部署了,这是部署方法1 第2种部署方法我们也知道,就是讲web应用的文件夹拷贝到webap ...

随机推荐

  1. React-Native新列表组件FlatList和SectionList学习 | | 联动列表实现

    React-Native在0.43推出了两款新的列表组件:FlatList(高性能的简单列表组件)和SectionList(高性能的分组列表组件). 从官方上它们都支持常用的以下功能: 完全跨平台. ...

  2. Couldn't find preset "es2015" relative to directory问题解决

    由于是菜鸟没使用ES标准,而引入的vue-ueditor使用了ES标准,所以编译会报错,解决办法如下: npm install babel-preset-es2015 --save-dev 然后需要在 ...

  3. php批量检测和去掉bom头(转)

    <?php //有些php文件由于不小心保存成了含bom头的格式而导致出现一系列的问题.以下是批量清除bom头的代码 if (isset ( $_GET ['dir'] )) { //confi ...

  4. SignalTap导致PCIe Read/Write卡死

    /********************************************************************** * SignalTap导致PCIe Read/Write ...

  5. flask中自定义过滤器

    第一种方法: 1,第一步:自定义过滤器函数 # 自定义一个函数,将list里面的数据进行排序 def list_sort(list) return list.sort() 2.第二步:注册过滤器 第一 ...

  6. 用Redis存储Tomcat集群的Session实现session共享

    一.存储 前段时间,我花了不少时间来寻求一种方法,把新开发的代码推送到到生产系统中部署,生产系统要能够零宕机.对使用用户零影响. 我的设想是使用集群来搞定,通过通知负载均衡Nginx,取下集群中的To ...

  7. CSS 社区的解决方案,对比

    在众多解决方案中,没有绝对的优劣.还是要结合自己的场景来决定. 我们团队在使用过 scss 和 css modules 后,仍然又重新选择了使用 scss.css modules 虽然有效解决了样式冲 ...

  8. liunx top命令详解

    1,当前服务器时间,up,服务器离上一次重启过了多久,多少个用户在使用,cpu平均负载,grep 'core id' /proc/cpuinfo | sort -u | wc -l  ,一般来说4个, ...

  9. laravel路由别名

    在定义路由时使用数组键 as 指定路由名称: Route::get('user/profile', ['as' => 'profile', function () { // }]); 另外,还可 ...

  10. shutil 拷贝 / 移动 / 压缩 / 解压缩

    # shutil_demo.py 高级文件操作(拷贝 / 移动 / 压缩 / 解压缩) import shutil def shutil_demo(): # 拷贝文件 shutil.copy2('fi ...