[Machine Learning with Python] How to get your data?
Using Pandas Library
The simplest way is to read data from .csv files and store it as a data frame object:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.
energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])
For .txt files, you can also use read_csv function by defining the separation symbol:
university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)
See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html
Using os Module
Read .csv files:
import os
import csv
for file in os.listdir("objective_folder"):
with open('objective_folder/'+file, newline='') as csvfile:
rows = csv.reader(csvfile) # read csc file
for row in rows: # print each line in the file
print(row)
Read .xsl files:
import os
import xlrd
for file in os.listdir("objective_folder/"):
data = xlrd.open_workbook('objective_folder/'+file)
table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
nrows = table.nrows #row number
for i in range(nrows):
if i == 0: # skip the first row if it defines variable names
continue
row_values = table.row_values(i) #read each row value
print(row_values)
Download from Website Automatically
We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
What’s more?
These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.
[Machine Learning with Python] How to get your data?的更多相关文章
- 【Machine Learning】Python开发工具:Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
- Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
- Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
- 《Learning scikit-learn Machine Learning in Python》chapter1
前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
- Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...
- [Machine Learning with Python] Familiar with Your Data
Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...
- [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset
The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...
- [Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
- [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...
随机推荐
- vim小操作
初时,先有ed,ed为ex之父,ex为vi之父,而vi为vim之父 c 修改 d 删除 y 复制到寄存器 g~ 反转大小写 gu 反转为小写 gU 反转为大写 > 增加缩进 < 减小缩进 ...
- 第一课 项目的介绍 Thinkphp5第四季
学习地址: https://study.163.com/course/courseLearn.htm?courseId=1004887012#/learn/video?lessonId=1050543 ...
- guava笔记
guava是在原先google-collection 的基础上发展过来的,是一个比较优秀的外部开源包,最近项目中使用的比较多,列举一些点.刚刚接触就被guava吸引了... 这个是gua ...
- Eclipse设置C++自动补全变量名快捷键
用快捷键:Alt+/ 要是还是有些场合不能提示,按照下列步骤 Window-Preferences-c/c++-Editor-Content Assist-Advanced 将未勾选的全部勾选
- centos7 killall 命令
centos7精简安装后,使用中发现没有killall命令. 安装这个包即可: yum install psmisc
- Pond Cascade Gym - 101670B 贪心+数学
题目:题目链接 思路:题目让求最下面池子满的时间和所有池子满的时间,首先我们考虑所有池子满的时间,我们从上到下考虑,因为某些池子满了之后溢出只能往下溢水,考虑当前池子如果注满时间最长,那么从第一个池子 ...
- BZOJ 4896: [Thu Summer Camp2016]补退选
trie树+vector+二分 别忘了abs(ans) #include<cstdio> #include<algorithm> #include<vector> ...
- UVa 10118 记忆化搜索 Free Candies
假设在当前状态我们第i堆糖果分别取了cnt[i]个,那么篮子里以及口袋里糖果的个数都是可以确定下来的. 所以就可以使用记忆化搜索. #include <cstdio> #include & ...
- Linux inode 之我见
Linux硬盘组织方式为:引导区.超级块(superblock),索引结点(inode),数据块(datablock),目录块(diredtory block).其中超级块中包含了关于该硬盘或分区上的 ...
- MongoDB 3.6 安装详解
在ubuntu和多数linux发行版的包安装源中MongoDB默认的版本是2.4,但2.4所使用的存储引擎不支持collecitons级别的锁,只支持database级别的,所以在开发中2.4版本的m ...