Python机器学习—导入各种数据的N种办法

pandas 读取数据

一、导入一般的文件

1、read_csv()，用来读取CSV文件

官方文档是这么说的：Read CSV (comma-separated) file into DataFrame

在读取CSV之前首先得知道什么事CSV文件：csv文件的第一行是列名，后面的都是数据，列与列之间用逗号隔开,列名有时可以省略，如下所示

AAPL,28-01-2011, ,344.17,344.4,333.53,336.1,21144800
AAPL,31-01-2011, ,335.8,340.04,334.3,339.32,13473000
AAPL,01-02-2011, ,341.3,345.65,340.98,345.03,15236800
AAPL,02-02-2011, ,344.45,345.25,343.55,344.32,9242600
AAPL,03-02-2011, ,343.8,344.24,338.55,343.44,14064100
AAPL,04-02-2011, ,343.61,346.7,343.51,346.5,11494200
AAPL,07-02-2011, ,347.89,353.25,347.64,351.88,17322100
AAPL,08-02-2011, ,353.68,355.52,352.15,355.2,13608500
AAPL,09-02-2011, ,355.19,359,354.87,358.16,17240800
AAPL,10-02-2011, ,357.39,360,348,354.54,33162400
AAPL,11-02-2011, ,354.75,357.8,353.54,356.85,13127500
AAPL,14-02-2011, ,356.79,359.48,356.71,359.18,11086200

使用read_csv函数导入csv文件

语法：read_csv(file,encoding)

file:csv文件所在的路径，如果文件在工作路径下，这里直接写文件名就好，如果不在工作路径下，得把文件的路径也写上

encoding：文件的编码类型，如果导入的是中文，则设置为utf-8

%cd "E:\WorkSpace\Python"-------------->设置工作空间

例：

<pre name="code" class="python">In[35]:%cd "E:\WorkSpace\Python"
from pandas import read_csv
cs = read_csv("student.csv",encoding='utf-8')
cs
E:\WorkSpace\Python

我运行以上代码报错：UnicodeDecodeError: 'utf8' codec can't decode byte 0xb8 in position 0: invalid start byte
于是我把cs = read_csv("student.csv",encoding='utf-8')改成cs = read_csv("student.csv")，没报错了，英文可以正常显示，但是中文读出来是乱码

后来我想会不会是文件本身的编码有问题，于是在Notepad中查看了一下文件的编码，果然不出所料，文件的编码是默认的ANSI格式，在Notepad里把文件编码改了，果然，成功排雷，读取成功了：结果如下：

In[35]:%cd "E:\WorkSpace\Python"
from pandas import read_csv
cs = read_csv("student.csv",encoding='utf-8')
cs
E:\WorkSpace\Python
Out[35]:
付靖玲 23 女
0 Jeny 24 女
1 Tom 25 男

但是问题来了，因为我的数据没有列名，它读出来默认把我的第一行作为列名了，继续挖雷...,看了一下pandas的API，终于豁然开朗，大有收获，原来这个read_csv函数是有很多参数的，它的函数申明格式如下：

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None,usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None,converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=None,nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True,parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False,iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None,quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False,error_bad_lines=True, warn_bad_lines=True, skip_footer=0, doublequote=True, delim_whitespace=False,as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None,memory_map=False, float_precision=None)¶

这里的每一个参数的用法我就不一一解释了，如果用到的时候可以去查看API，里面解释的非常清楚，继续说我的雷，API告诉我要解决这个问题，关键在于header这个参数，header参数的说明如下：

header : int or list of ints, default ‘infer’

Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

意思也就是说，这个参数是用来设置数据的列名的，默认第0行就是数据的列名，如果没有列名的话，就把这个参数设置成None，于是把代码改成以下形式

cs = read_csv("student.csv",encoding='utf-8',header=None)
cs
Out[45]:
0 1 2
0 付靖玲 23 女
1 Jeny 24 女
2 Tom 25 男

成功排了这颗雷。。。

1、read_table()

读入普通的文本文件，文本文件比csv文件更加自由，可以没有列名，列与列的分隔符没有规定，而且后缀名也是可以随意的

API中的声明：

官方文档是这么说的：Read general delimited file into DataFrame

常用的参数就前三个

filepath_or_buffer:文件路径

sep：分隔符，默认为空，表示导入为一列，分隔符也可以是一个字符串，如例子所示

names：列名，默认第一行为列名

encoding：文件编码，导入中文时需要设置为utf-8

例：我的文件里存的内容如下所示，我以love作为分隔符读出这个文件里的内容

12love34
34love545
343love455
767love545

>>> ca = pd.read_table("a.txt",sep='love',header=None)
>>> ca
0 1
0 12 34
1 34 545
2 343 455
3 767 545

上面读取的结果默认列名是0、1，如果我要设置一个列名给它怎么办呢：

>>> ca = pd.read_table("a.txt",sep='love',names=['start','end'],header=None)
>>> ca
start end
0 12 34
1 34 545
2 343 455
3 767 545

3、read_fwf

官方文档是这么说的：Read a table of fixed-width formatted lines into DataFrame

二、导入excel文件

4、read_excel()

该函数用来导入excel文件

函数的原型声明如下：

pandas.read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None,parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True,has_index_names=None, converters=None, engine=None, squeeze=False, **kwds)

常用的参数有：

io : （文件路径）string, path object (pathlib.Path or py._path.local.LocalPath),

file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx

names：列名，默认第一行为列名，也可指定列名

sheetname：表单的名字，默认就是excel中的第0个表单被导入