Using Pandas Library

The simplest way is to read data from .csv files and store it as a data frame object:

import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.

energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])

For .txt files, you can also use read_csv function by defining the separation symbol:

university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)

See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html

Using os Module

Read .csv files:

import os
import csv
for file in os.listdir("objective_folder"):
with open('objective_folder/'+file, newline='') as csvfile:
rows = csv.reader(csvfile) # read csc file
for row in rows: # print each line in the file
print(row)

Read .xsl files:

import os
import xlrd
for file in os.listdir("objective_folder/"):
data = xlrd.open_workbook('objective_folder/'+file)
table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
nrows = table.nrows #row number
for i in range(nrows):
if i == 0: # skip the first row if it defines variable names
continue
row_values = table.row_values(i) #read each row value
print(row_values)

Download from Website Automatically

We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.

 import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()

when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:

import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)

What’s more?

These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

[Machine Learning with Python] How to get your data?的更多相关文章

  1. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  2. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  3. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  4. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  5. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  6. [Machine Learning with Python] Familiar with Your Data

    Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...

  7. [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

    The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...

  8. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  9. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

随机推荐

  1. 小程序电脑调试没有问题,真机预览报错fail hand shake error

    今天在做小程序的过程中使用HTTPS请求数据时,遇到安卓机型无法获取到数据,通过一系列的排查,发现是因为ssl证书的问题,后来通过https://www.myssl.cn/tools/check-se ...

  2. Install GStreamer on Ubuntu

    apt-get install libgstreamer1.0-0 gstreamer1.0-plugins-base gstreamer1.0-plugins-good gstreamer1.0-p ...

  3. win10安装pytorch——前面有坑,快跳进去鸭

    嗯!花费了不少时间才把pytorch安装成功.主要原因就是: 清华和中科大的Anaconda国内镜像源关闭了 activate.bat 不是内部或外部命令(这个真实奇怪) 1. 安装过程 可以去Ana ...

  4. JAVA里的别名机制

    别名现象主要出现在赋值的问题上: 对基本数据类型的赋值是很简单的.基本数据类型存储了实际的数值,而并非指向一个对象的引用,所以在为其赋值的时候,是直接将一个地方的内容复制到了另一个地方.例如,对基本数 ...

  5. tomcat6-servlet规范对接 与 ClassLoader隔离

    之前写的一个ppt 搬到博客来

  6. 九度oj 题目1376:最近零子序列

    题目描述: 给定一个整数序列,你会求最大子串和吗?几乎所有的数据结构与算法都会描述求最大子串和的算法.今天让大家来算算最近0子串和,即整数序列中最接近0的连续子串和.例如,整数序列6, -4, 5, ...

  7. DS博客作业05—树

    1.本周学习总结 1.1思维导图 1.2学习体会 本周学习了树的相关知识,了解了树结构体的应用和基本操作 学习了二叉树的遍历,创建以及哈夫曼树的相关操作 通过树的构建等操作熟练了递归的使用 2.PTA ...

  8. Z-Score数据标准化处理(python代码)

    #/usr/bin/python def Z_Score(data): lenth = len(data) total = sum(data) ave = float(total)/lenth tem ...

  9. 【Luogu】P2445动物园(最大流)

    题目链接 题目本身还是比较水的吧……容易发现是最大流套上dinic跑一遍就好了,并不会超时. 比较不偷税的一点是关于某动物的所有目击报告都符合才能连边……qwqqwqqwq #include<c ...

  10. HDU——1058Humble Numbers(找规律)

    Humble Numbers Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) T ...