量化Hacker News 中50天的数据 Quantifying Hacker News with 50 days of data

GarfieldEr007 2024-08-15 03:07:34 原文

Quantifying Hacker News

I thought it would be fun to analyze the activity on one of my favorite sources of interesting links and information, Hacker News. My source of data is a script I've set up some time in August that downloads HN (the Front page and the New stories page) every minute. We will be interested in visualizing the stories as they get upvoted during the day, figuring out which domains/users are most popular, what topics are most popular, and the best time to post a story. I'm making all my data and code (Python data collection scripts + IPython Notebook for analysis) available in case you'd like to carry out a similar analysis.

Data collection protocol

I set up a very simple python script that scrapes the HN front page and the new stories page every minute. A single day of data begins at 4am (PST) and ends at 4am the next day. The .html files are saved compressed as gzipped pickles and one day occupies roughly 10mb in this format. I had bring down my machine for a few days a few times so there are some gaps in the data, but in the end we get 47 days of data from period between August 22 and October 30.

Raw HTML data parsing

The parsing Python script uses BeautifulSoup to convert the raw HTML into a more structured JSON. This script was by the way by no means simple to write -- HN is based on unstructured tables and I had to discover many strange edge cases in its behavior along the way. At the end I ended up with a 100-line ugliest-parsing-function-ever (really, I'm not proud of it) but it works and outputs something like the following for a single story at a specific snapshot:

{

'domain': u'play.google.com', 'title': u'Nexus 5',

'url': u'https://play.google.com/store/devices/details?id=nexus_5_black_16gb',

'num_comments': 42, 'rank': 1, 'points': 65,

'user': u'sonier', 'minutes_ago': 39, 'id': u'6648519'

}

We get 60 such entries every minute (30 for front page and 30 for new page) and these are again all saved to disk. We are now ready to bring out the IPython Notebook and get to the juicy analysis!

The Analysis: Detailed analysis

Head over to the IPython Notebook rendered as HTML for the analysis:

Note: I had the entire dataset and .ipynb Ipython Notebook source available for download but recently took it down to save space on my host (sorry).

from: http://karpathy.github.io/2013/11/27/quantifying-hacker-news/

量化Hacker News 中50天的数据 Quantifying Hacker News with 50 days of data的更多相关文章

Hi3559AV100 NNIE开发（5）mobilefacenet.wk仿真成功量化及与CNN_convert_bin_and_print_featuremap.py输出中间层数据对比过程
前面随笔给出了NNIE开发的基本知识,下面几篇随笔将着重于Mobilefacenet NNIE开发,实现mobilefacenet.wk的chip版本,并在Hi3559AV100上实现mobilefa ...
分享一个SQLSERVER脚本（计算数据库中各个表的数据量和每行记录所占用空间）
分享一个SQLSERVER脚本(计算数据库中各个表的数据量和每行记录所占用空间) 很多时候我们都需要计算数据库中各个表的数据量和每行记录所占用空间这里共享一个脚本 CREATE TABLE #tab ...
for循环往Oracle中插入n条数据，主键自增
1.主键自增实现方法:http://www.cnblogs.com/Donnnnnn/p/5959871.html 2.for循环往Oracle中插入n条数据 BEGIN .. loop insert ...
转：SQL SERVER数据库中实现快速的数据提取和数据分页
探讨如何在有着1000万条数据的MS SQL SERVER数据库中实现快速的数据提取和数据分页.以下代码说明了我们实例中数据库的“红头文件”一表的部分数据结构: CREATE TABLE [dbo]. ...
SQL Server 2008中新增的变更数据捕获（CDC）和更改跟踪
来源:http://www.cnblogs.com/downmoon/archive/2012/04/10/2439462.html 本文主要介绍SQL Server中记录数据变更的四个方法:触发器 ...
在JSP页面中输出JSON格式数据
JSON-taglib是一套使在JSP页面中输出JSON格式数据的标签库. JSON-taglib主页: http://json-taglib.sourceforge.net/index.html J ...
（转）分享一个SQLSERVER脚本（计算数据库中各个表的数据量和每行记录所占用空间）
分享一个SQLSERVER脚本(计算数据库中各个表的数据量和每行记录所占用空间) 很多时候我们都需要计算数据库中各个表的数据量和每行记录所占用空间这里共享一个脚本 CREATE TABLE #tab ...
Android编程中的5种数据存储方式
Android编程中的5种数据存储方式作者:牛奶.不加糖字体:[增加减小] 类型:转载时间:2015-12-03我要评论这篇文章主要介绍了Android编程中的5种数据存储方式,结合实例形式 ...
另类爬虫：从PDF文件中爬取表格数据
简介本文将展示一个稍微不一样点的爬虫. 以往我们的爬虫都是从网络上爬取数据,因为网页一般用HTML,CSS,JavaScript代码写成,因此,有大量成熟的技术来爬取网页中的各种数据.这次, ...

随机推荐

Azure IaaS for IT Pros Online Event 总结
微软一个为期4天的一个有关于Azure的介绍,主要总结了些Azure现有的技术以及将会推出东西主题链接 http://channel9.msdn.com/Events/Microsoft-Azure ...
java之classpath到底是什么
如果你输入一个命令,比如java那么系统是如何找到这个命令的呢?按照顺序,系统先在当前目录搜索是否有java.exe, java.bat 等. 如果没有,就得到系统的PATH(不区分大小写)里面查找. ...
常见前端面试题之HTML/CSS部分
转自http://www.cnblogs.com/jscode/archive/2012/07/10/2583856.html Doctype是什么?如何触发严格模式与混杂模式模式?区分它们有何意义? ...
【笔记】UML核心元素
1.参与者定义:在系统之外与系统交互的某人或某物. 特点:1.可以非人:2.与系统直接交互:3.主动发出动作并获得反馈:4.涉众(stakerholder)的代表具有两个版型: 1.业务主角(bu ...
简单制作mib表
今天放假后第一天上班,将假前自学制作mib表的东西说一下. 在这里呢,我以世界-中国-上海-闵行这种包含关系介绍,感觉更容易理解. MIB file的开始和结束所有的MIB file的都以DEFIN ...
在ubuntu16.04 下安装haproxy 1.5.11 做tcp负载均衡
由于haproxy需要FQ下载,所以从csdn下载了较为新版的haproxy1.5.11,安装过程如下: 1. 解压haproxy-1.5.11.tar.gz : tar xzvf haproxy-1 ...
被git extensions给坑了，Not owner 解决办法
我只遇到这一种情况,不能访问git文件主库, 清空以前的历史文件,包括默认配置文件,重新安装一遍git extension,然后设置账号和密码, 在打开bash设置私钥和公钥把公钥添加到你的git的 ...
2208: [Jsoi2010]连通数 - BZOJ
Description Input 输入数据第一行是图顶点的数量,一个正整数N. 接下来N行,每行N个字符.第i行第j列的1表示顶点i到j有边,0则表示无边. Output 输出一行一个整数,表示该图 ...
QT windows msvc下使用boost库（备忘）
win32-msvc2015: { contains(QMAKE_HOST.arch, x86):{ INCLUDEPATH += D:\3SDK\boost_1_61_0 LIBS += -LD:\ ...
ios map 显示用户位置
昨天遇到个奇怪的问题,用户的位置在地图中死活不显示,showUserLocation也设置了,最后发现是因为实现了 mapView protocol中的一个方法: -(MKAnnotationView ...