Using Pandas Library

The simplest way is to read data from .csv files and store it as a data frame object:

import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.

energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])

For .txt files, you can also use read_csv function by defining the separation symbol:

university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)

See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html

Using os Module

Read .csv files:

import os

import csv

for file in os.listdir("objective_folder"):

	with open('objective_folder/'+file, newline='') as csvfile:

	rows = csv.reader(csvfile) # read csc file

	for row in rows: # print each line in the file

		print(row)

Read .xsl files:

import os

import xlrd

for file in os.listdir("objective_folder/"):

	data = xlrd.open_workbook('objective_folder/'+file)

	table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel

	nrows = table.nrows #row number

	for i in range(nrows):

		if i == 0: # skip the first row if it defines variable names

		continue

		row_values = table.row_values(i) #read each row value

		print(row_values)

Download from Website Automatically

We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.

 import os

 import tarfile

 from six.moves import urllib

 DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"

 HOUSING_PATH = "datasets/housing"

 HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

 def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):

     if not os.path.isdir(housing_path):

         os.makedirs(housing_path)

         tgz_path = os.path.join(housing_path, "housing.tgz")

         urllib.request.urlretrieve(housing_url, tgz_path)

         housing_tgz = tarfile.open(tgz_path)

         housing_tgz.extractall(path=housing_path)

         housing_tgz.close()

when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):

    csv_path = os.path.join(housing_path, "housing.csv")

    return pd.read_csv(csv_path)

What’s more?

These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

[Machine Learning with Python] How to get your data?的更多相关文章

【Machine Learning】Python开发工具：Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
《Learning scikit-learn Machine Learning in Python》chapter1
前言由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...
[Machine Learning with Python] Familiar with Your Data
Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...
[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset
The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...
[Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

随机推荐

PTA 7-2 符号配对
直接用栈模拟即可,数组可做,但因为这节数据结构是栈,为了期末考试还是手写一下栈的操作,值得注意的是,这道题用gets函数在PTA上会编译错误,用scanf("%[^\n]", st ...
POJ：2109-Power of Cryptography（关于double的误差）
Power of Cryptography Time Limit: 1000MS Memory Limit: 30000K Description Current work in cryptograp ...
JAVA、JDK等入门概念，下载安装JAVA并配置环境变量
一.概念 Java是一种可以撰写跨平台应用程序的面向对象的程序设计语言,具体介绍可查阅百度JAVA百科,这里不再赘述. Java分为三个体系,分别为: Java SE(J2SE,Java2 Platf ...
UVa 1354 枚举子集 Mobile Computing
只要枚举左右两个子天平砝码的集合,我们就能算出左右两个悬挂点到根悬挂点的距离. 但是题中要求找尽量宽的天平但是不能超过房间的宽度,想不到要怎样记录结果. 参考别人代码,用了一个结构体的vector,保 ...
sql 查询数据库中每个表的大小
For example: exec sp_MSForEachTable @precommand=N'create table temp(name sysname,rows bigint,reserve ...
javascript 内置日期转换方法
var d = new Date(); console.log(d); // 输出:Mon Nov 04 2013 21:50:33 GMT+0800 (中国标准时间) console.log(d.t ...
hnust 不爱学习的小w
问题 C: 不爱学习的小W 时间限制: 2 Sec 内存限制: 64 MB提交: 1431 解决: 102[提交][状态][讨论版] 题目描述 “叮铃铃”上课了,同学们都及时到了教室坐到了座位上, ...
POJ 1240 Pre-Post-erous!
k叉树的前序和后续遍历,问一共有多少种这样的k叉树这个就是树的同构,组合数就能解决同样的题目在51nod也有的,我的另一篇博客 POJ 1240 Pre-Post-erous! We are al ...
PTA 11-散列1 电话聊天狂人 (25分)
题目地址 https://pta.patest.cn/pta/test/15/exam/4/question/722 5-14 电话聊天狂人 (25分) 给定大量手机用户通话记录,找出其中通话次数 ...
知名游戏引擎公司Havok发布免费3D移动游戏引擎“Project Anarchy”
自EA发布“寒霜”引擎(Frostbite Engine)移动版后,知名游戏引擎公司Havok也发布了免费的3D移动游戏引擎“Project Anarchy”. 据悉,6月底时候,Intel旗下知名游 ...