Datasets for Data Mining and Data Science
https://github.com/mattbane/RecommenderSystem
http://grouplens.org/datasets/movielens/
From kdnuggets
Data repositories
- AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
- BigML big list of public data sources.
- Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
- Bitly 1.usa.gov data, anonymized clicks on gov links.
- Canada Open Data, pilot project with many government and geospatial datasets.
- Causality Workbench data repository.
- Corral Big Data repository at Texas Advanced Computing Center, supporting data-centric science.
- Data Source Handbook, A Guide to Public Data, by Pete Warden, O'Reilly (Jan 2011).
- Datacatalogs.org, open government data from US, EU, Canada, CKAN, and more.
- Data.gov.uk, publicly available data from UK (also London datastore.)
- Data.gov/Education, central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more.
- DataMarket, visualize the world's economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.
- Datamob, public data put to good use.
- DataSF.org, a clearinghouse of datasets available from the City & County of San Francisco, CA.
- DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Government datasets.
- Delve, Data for Evaluating Learning in Valid Experiments
- EconData, thousands of economic time series, produced by a number of US Government agencies.
- Enron Email Dataset, data from about 150 users, mostly senior management of Enron.
- Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana - the trusted and comprehensive resource for European cultural heritage content.
- FEDSTATS, a comprehensive source of US statistics and more
- FIMI repository for frequent itemset mining, implementations and datasets.
- Financial Data Finder at OSU, a large catalog of financial data sets.
- GDELT: The Global Data on Events, Location and Tone, described by Guardian as "a big data history of life, the universe and everything."
- GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
- GeoDa Center, geographical and spatial data.
- Google ngrams datasets, text from millions of books scanned by Google.
- Grain Market Research, financial data including stocks, futures, etc.
- Hilary Mason research-quality Big Data sets collection - many text and image datasets.
- HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning.
- ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
- Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.
- Investor Links, includes financial data
- KDD Cup center, with all data, tasks, and results.
- Kevin Chai list of datasets, for text, SNA, and other fields.
- KONECT, the Koblenz Network Collection, with large network datasets of all types in order to perform research in the area of network mining.
- Linking Open Data project, at making data freely available to everyone.
- Million Song Dataset
- MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
- ML Data, the data repository of the EU Pascal2 networks.
- NASDAQ Data Store, provides access to market data.
- National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
- National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
- NetworkRepository: Interactive Data Repository, has many collections of graph and networks from social science, machine learning, scientific computing, and other areas.
- Open Data Census, assesses the state of open data around the world.
- OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun.
- Open Source Sports, many sports databases, including Baseball, Football, Basketball, and Hockey.
- Peter Skomoroch dataset Bookmarks
- PubGene(TM) Gene Database and Tools, genomic-related publications database
- Quandl, a collaboratively curated portal to millions of financial and economic time-series datasets.
- qunb, a platform to find and visualize quantitative data.
- Robert Schiller data on housing, stock market, and more from his bookIrrational Exuberance.
- SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
- Jerry Smith dataset collection, with Finance, Government, Machine Learning, Science, and other data.
- SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site.
- StatLib, CMU Datasets Archive.
- STATOO Datasets part 1 and STATOO Datasets part 2
- Time Series Data Library
- Visual Analytics Benchmark Repository.
- UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
- UCI Machine Learning Repository.
- UCR Time Series Data Archive, offering datasets, papers, links, and code.
- United States Census Bureau.
- Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources
- Wolfram Alpha disease and patient level data.
- Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition
- Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.
Related
- Data Mining Competitions
- KDD Cup results summary
Datasets for Data Mining and Data Science的更多相关文章
- What’s the difference between data mining and data warehousing?
Data mining is the process of finding patterns in a given data set. These patterns can often provide ...
- data mining,machine learning,AI,data science,data science,business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...
- 数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics)之间有什么关系?
本来我以为不需要解释这个问题的,到底数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)有什么区别,但是前几天因为有个学弟问我,我想了想发现我竟然也回答 ...
- How do you explain Machine Learning and Data Mining to non Computer Science people?
How do you explain Machine Learning and Data Mining to non Computer Science people? Pararth Shah, ...
- Machine Learning and Data Mining(机器学习与数据挖掘)
Problems[show] Classification Clustering Regression Anomaly detection Association rules Reinforcemen ...
- 论文翻译:Data mining with big data
原文: Wu X, Zhu X, Wu G Q, et al. Data mining with big data[J]. IEEE transactions on knowledge and dat ...
- 18 Candidates for the Top 10 Algorithms in Data Mining
Classification============== #1. C4.5 Quinlan, J. R. 1993. C4.5: Programs for Machine Learning.Morga ...
- Distributed Databases and Data Mining: Class timetable
Course textbooks Text 1: M. T. Oszu and P. Valduriez, Principles of Distributed Database Systems, 2n ...
- What is the most common software of data mining? (整理中)
What is the most common software of data mining? 1 Orange? 2 Weka? 3 Apache mahout? 4 Rapidminer? 5 ...
随机推荐
- 警告: [SetContextPropertiesRule]{Context} Setting property 'source' to 'org.eclipse.jst.jee.server:CurrencyClientServe
有的说: 在Servers视图里双击创建的server,然后在其server的配置界面中选中"Publish module contexts to separate XML files&qu ...
- iframe标签用法详解(属性、透明、自适应高度)
1.iframe 定义和用法 iframe 元素会创建包含另外一个文档的内联框架(即行内框架). HTML 与 XHTML 之间的差异 在 HTML 4.1 Strict DTD 和 XHTML 1. ...
- 【Gym 100685J】Just Another Disney Problem(交互/排序)
第一次做交互题. 题意是有n个数(n<1000),你通过问1 a b,后台返回你YES代表a<b,NO代表a>b.要你在10000次询问内给出一个符合的排列.n=1000来说,100 ...
- 修改mysql最大连接数的方法
MYSQL数据库安装完成后,默认最大连接数是100,一般流量稍微大一点的论坛或网站这个连接数是远远不够的,增加默认MYSQL连接数的方法有两个 方法一:进入MYSQL安装目录 打开MYSQL配置文件 ...
- BZOJ-4010 菜肴制作 贪心+堆+(拓扑图拓扑序)
无意做到...char哥还中途强势插入干我...然后据他所言,看了一会题,一转头,我爆了正解....可怕 4010: [HNOI2015]菜肴制作 Time Limit: 5 Sec Memory L ...
- html input file标签的上传文件 注意点
文件上传框 代码格式:<input type=“file” name=“...” size=“15” input enctype="multipart/form-data“ maxl ...
- The Dirichlet Distribution 狄利克雷分布 (PRML 2.2.1)
The Dirichlet Distribution 狄利克雷分布 (PRML 2.2.1) Dirichlet分布可以看做是分布之上的分布.如何理解这句话,我们可以先举个例子:假设我们有一个骰子,其 ...
- hdu 1257 最少拦截系统(贪心)
解题思路:[要充分理解题意,不可断章取义] 贪心:每个防御系统要发挥其最大性能, 举例: Input : 9 389 207 155 300 299 170 155 158 65 Output: 2 ...
- Java中数据类型及其之间的转换
Java中数据类型及其之间的转换 基本的数据类型 基本类型有以下四种:1)int长度数据类型有:byte(8bits).short(16bits).int(32bits).long(64bits).2 ...
- alert对ajax阻塞调查(IE, Chrome, FF)
前阵子做保守工作,对一个js效果进行了改进,由于自己在chrome下测试没问题就丢给同事测试,同事用的是FF,发现不正常,后来又发现这个js在IE10下也不行,不得不调查,结果发现Chrome的ale ...