scikit-learn数据集

我们将介绍sklearn中的数据集类，模块包括用于加载数据集的实用程序，包括加载和获取流行参考数据集的方法。它还具有一些人工数据生成器。

sklearn.datasets

（1）datasets.load_*()

获取小规模数据集，数据包含在datasets里

（2）datasets.fetch_*()

获取大规模数据集，需要从网络上下载，函数的第一个参数是data_home，表示数据集下载的目录，默认是 ~/scikit_learn_data/，要修改默认目录，可以修改环境变量SCIKIT_LEARN_DATA

（3）datasets.make_*()

本地生成数据集

load*和 fetch* 函数返回的数据类型是 datasets.base.Bunch，本质上是一个 dict，它的键值对可用通过对象的属性方式访问。主要包含以下属性：

data：特征数据数组，是 n_samples * n_features 的二维 numpy.ndarray 数组
target：标签数组，是 n_samples 的一维 numpy.ndarray 数组
DESCR：数据描述
feature_names：特征名
target_names：标签名

数据集目录可以通过datasets.get_data_home()获取，clear_data_home(data_home=None)删除所有下载数据

datasets.get_data_home(data_home=None)

返回scikit学习数据目录的路径。这个文件夹被一些大的数据集装载器使用，以避免下载数据。默认情况下，数据目录设置为用户主文件夹中名为“scikit_learn_data”的文件夹。或者，可以通过“SCIKIT_LEARN_DATA”环境变量或通过给出显式的文件夹路径以编程方式设置它。'〜'符号扩展到用户主文件夹。如果文件夹不存在，则会自动创建。

sklearn.datasets.clear_data_home(data_home=None)

删除存储目录中的数据

获取小数据集

用于分类

sklearn.datasets.load_iris

class sklearn.datasets.load_iris(return_X_y=False)

  """

  加载并返回虹膜数据集

  :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

  :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

  """

In [12]: from sklearn.datasets import load_iris

    ...: data = load_iris()

    ...:

In [13]: data.target

Out[13]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [14]: data.feature_names

Out[14]:

['sepal length (cm)',

 'sepal width (cm)',

 'petal length (cm)',

 'petal width (cm)']

In [15]: data.target_names

Out[15]:

array(['setosa', 'versicolor', 'virginica'],

      dtype='|S10')

In [17]: data.target[[1,10, 100]]

Out[17]: array([0, 0, 2])

名称	数量
类别	3
特征	4
样本数量	150
每个类别数量	50

sklearn.datasets.load_digits

class sklearn.datasets.load_digits(n_class=10, return_X_y=False)

    """

    加载并返回数字数据集

    :param n_class: 整数，介于0和10之间，可选（默认= 10，要返回的类的数量

    :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

    :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

    """

In [20]: from sklearn.datasets import load_digits

In [21]: digits = load_digits()

In [22]: print(digits.data.shape)

(1797, 64)

In [23]: digits.target

Out[23]: array([0, 1, 2, ..., 8, 9, 8])

In [24]: digits.target_names

Out[24]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [25]: digits.images

Out[25]:

array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],

        [  0.,   0.,  13., ...,  15.,   5.,   0.],

        [  0.,   3.,  15., ...,  11.,   8.,   0.],

        ...,

        [  0.,   4.,  11., ...,  12.,   7.,   0.],

        [  0.,   2.,  14., ...,  12.,   0.,   0.],

        [  0.,   0.,   6., ...,   0.,   0.,   0.]],

        [[  0.,   0.,  10., ...,   1.,   0.,   0.],

        [  0.,   2.,  16., ...,   1.,   0.,   0.],

        [  0.,   0.,  15., ...,  15.,   0.,   0.],

        ...,

        [  0.,   4.,  16., ...,  16.,   6.,   0.],

        [  0.,   8.,  16., ...,  16.,   8.,   0.],

        [  0.,   1.,   8., ...,  12.,   1.,   0.]]])

名称	数量
类别	10
特征	64
样本数量	1797

用于回归

sklearn.datasets.load_boston

class  sklearn.datasets.load_boston(return_X_y=False)

  """

  加载并返回波士顿房价数据集

  :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

  :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

  """

In [34]: from sklearn.datasets import load_boston

In [35]: boston = load_boston()

In [36]: boston.data.shape

Out[36]: (506, 13)

In [37]: boston.feature_names

Out[37]:

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',

       'TAX', 'PTRATIO', 'B', 'LSTAT'],

      dtype='|S7')

In [38]:

名称	数量
目标类别	5-50
特征	13
样本数量	506

sklearn.datasets.load_diabetes

class sklearn.datasets.load_diabetes(return_X_y=False)

  """

  加载和返回糖尿病数据集

  :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

  :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

  """

In [13]:  from sklearn.datasets import load_diabetes

In [14]: diabetes = load_diabetes()

In [15]: diabetes.data

Out[15]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,

         0.01990842, -0.01764613],

       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,

        -0.06832974, -0.09220405],

       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,

         0.00286377, -0.02593034],

       ...,

       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,

        -0.04687948,  0.01549073],

       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,

         0.04452837, -0.02593034],

       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,

        -0.00421986,  0.00306441]])

名称	数量
目标范围	25-346
特征	10
样本数量	442

获取大数据集

sklearn.datasets.fetch_20newsgroups

class sklearn.datasets.fetch_20newsgroups(data_home=None, subset='train', categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)

  """

  加载20个新闻组数据集中的文件名和数据

  :param subset: 'train'或者'test','all'，可选，选择要加载的数据集：训练集的“训练”，测试集的“测试”，两者的“全部”，具有洗牌顺序

  :param data_home: 可选，默认值：无，指定数据集的下载和缓存文件夹。如果没有，所有scikit学习数据都存储在'〜/ scikit_learn_data'子文件夹中

  :param categories: 无或字符串或Unicode的集合，如果没有（默认），加载所有类别。如果不是无，要加载的类别名称列表（忽略其他类别）

  :param shuffle: 是否对数据进行洗牌

  :param random_state: numpy随机数生成器或种子整数

  :param download_if_missing: 可选，默认为True，如果False，如果数据不在本地可用而不是尝试从源站点下载数据，则引发IOError

  :param remove: 元组

  """

In [29]: from sklearn.datasets import fetch_20newsgroups

In [30]: data_test = fetch_20newsgroups(subset='test',shuffle=True, random_sta

    ...: te=42)

In [31]: data_train = fetch_20newsgroups(subset='train',shuffle=True, random_s

    ...: tate=42)

sklearn.datasets.fetch_20newsgroups_vectorized

class sklearn.datasets.fetch_20newsgroups_vectorized(subset='train', remove=(), data_home=None)

  """

  加载20个新闻组数据集并将其转换为tf-idf向量，这是一个方便的功能; 使用sklearn.feature_extraction.text.Vectorizer的默认设置完成tf-idf 转换。对于更高级的使用（停止词过滤，n-gram提取等），将fetch_20newsgroup与自定义Vectorizer或CountVectorizer组合在一起

  :param subset: 'train'或者'test','all'，可选，选择要加载的数据集：训练集的“训练”，测试集的“测试”，两者的“全部”，具有洗牌顺序

  :param data_home: 可选，默认值：无，指定数据集的下载和缓存文件夹。如果没有，所有scikit学习数据都存储在'〜/ scikit_learn_data'子文件夹中

  :param remove: 元组

  """

In [57]: from sklearn.datasets import fetch_20newsgroups_vectorized

In [58]: bunch = fetch_20newsgroups_vectorized(subset='all')

In [59]: from sklearn.utils import shuffle

In [60]: X, y = shuffle(bunch.data, bunch.target)

    ...: offset = int(X.shape[0] * 0.8)

    ...: X_train, y_train = X[:offset], y[:offset]

    ...: X_test, y_test = X[offset:], y[offset:]

    ...:

获取本地生成数据

生成本地分类数据：

sklearn.datasets.make_classification

class make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

"""

生成用于分类的数据集

:param n_samples:int，optional（default = 100)，样本数量

:param n_features:int，可选（默认= 20），特征总数

:param n_classes:int，可选（default = 2),类（或标签）的分类问题的数量

:param random_state:int，RandomState实例或无，可选（默认=无）

  如果int，random_state是随机数生成器使用的种子; 如果RandomState的实例，random_state是随机数生成器; 如果没有，随机数生成器所使用的RandomState实例np.random

:return :X,特征数据集；y,目标分类值

"""

from sklearn.datasets.samples_generator import make_classification

X,y= datasets.make_classification(n_samples=100000, n_features=20,n_informative=2, n_redundant=10,random_state=42)

生成本地回归数据：

sklearn.datasets.make_regression

class make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

  """

  生成用于回归的数据集

  :param n_samples:int，optional（default = 100)，样本数量

  :param  n_features:int,optional（default = 100)，特征数量

  :param  coef:boolean，optional（default = False），如果为True，则返回底层线性模型的系数

  :param random_state:int，RandomState实例或无，可选（默认=无）

    如果int，random_state是随机数生成器使用的种子; 如果RandomState的实例，random_state是随机数生成器; 如果没有，随机数生成器所使用的RandomState实例np.random

  :return :X,特征数据集；y,目标值

  """

from sklearn.datasets.samples_generator import make_regression

X, y = make_regression(n_samples=200, n_features=5000, random_state=42)

数据的分类与划分

数据集返回的类型

load和fetch返回的数据类型datasets.base.Bunch（字典格式）
data: 特征数据数组，是[n_samples * n_features]的二维numpy.ndarray数组
target:标签数组，是n_samples的一维numpy.ndarray数组
DESCR:数据描述
feature_names: 特征名,新闻数据，手写数字、回归数据集没有
target_names:标签名,回归数据集没有

分类数据集

鸢尾花数据集

In [ ]:

from sklearn.datasets import load_iris

# 实例化数据集

iris = load_iris()

#查看特征值

iris.data

Out[ ]:

array([[5.1, 3.5, 1.4, 0.2],

       [4.9, 3. , 1.4, 0.2],

       [4.7, 3.2, 1.3, 0.2],

       [4.6, 3.1, 1.5, 0.2],

       [5. , 3.6, 1.4, 0.2],

       [5.4, 3.9, 1.7, 0.4],

       [4.6, 3.4, 1.4, 0.3],

       [5. , 3.4, 1.5, 0.2],

       [4.4, 2.9, 1.4, 0.2],

       [4.9, 3.1, 1.5, 0.1],

       [5.4, 3.7, 1.5, 0.2],

       [4.8, 3.4, 1.6, 0.2],

       [4.8, 3. , 1.4, 0.1],

       [4.3, 3. , 1.1, 0.1],

       [5.8, 4. , 1.2, 0.2],

       [5.7, 4.4, 1.5, 0.4],

       [5.4, 3.9, 1.3, 0.4],

       [5.1, 3.5, 1.4, 0.3],

       [5.7, 3.8, 1.7, 0.3],

       [5.1, 3.8, 1.5, 0.3],

       [5.4, 3.4, 1.7, 0.2],

       [5.1, 3.7, 1.5, 0.4],

       [4.6, 3.6, 1. , 0.2],

       [5.1, 3.3, 1.7, 0.5],

       [4.8, 3.4, 1.9, 0.2],

       [5. , 3. , 1.6, 0.2],

       [5. , 3.4, 1.6, 0.4],

       [5.2, 3.5, 1.5, 0.2],

       [5.2, 3.4, 1.4, 0.2],

       [4.7, 3.2, 1.6, 0.2],

       [4.8, 3.1, 1.6, 0.2],

       [5.4, 3.4, 1.5, 0.4],

       [5.2, 4.1, 1.5, 0.1],

       [5.5, 4.2, 1.4, 0.2],

       [4.9, 3.1, 1.5, 0.2],

       [5. , 3.2, 1.2, 0.2],

       [5.5, 3.5, 1.3, 0.2],

       [4.9, 3.6, 1.4, 0.1],

       [4.4, 3. , 1.3, 0.2],

       [5.1, 3.4, 1.5, 0.2],

       [5. , 3.5, 1.3, 0.3],

       [4.5, 2.3, 1.3, 0.3],

       [4.4, 3.2, 1.3, 0.2],

       [5. , 3.5, 1.6, 0.6],

       [5.1, 3.8, 1.9, 0.4],

       [4.8, 3. , 1.4, 0.3],

       [5.1, 3.8, 1.6, 0.2],

       [4.6, 3.2, 1.4, 0.2],

       [5.3, 3.7, 1.5, 0.2],

       [5. , 3.3, 1.4, 0.2],

       [7. , 3.2, 4.7, 1.4],

       [6.4, 3.2, 4.5, 1.5],

       [6.9, 3.1, 4.9, 1.5],

       [5.5, 2.3, 4. , 1.3],

       [6.5, 2.8, 4.6, 1.5],

       [5.7, 2.8, 4.5, 1.3],

       [6.3, 3.3, 4.7, 1.6],

       [4.9, 2.4, 3.3, 1. ],

       [6.6, 2.9, 4.6, 1.3],

       [5.2, 2.7, 3.9, 1.4],

       [5. , 2. , 3.5, 1. ],

       [5.9, 3. , 4.2, 1.5],

       [6. , 2.2, 4. , 1. ],

       [6.1, 2.9, 4.7, 1.4],

       [5.6, 2.9, 3.6, 1.3],

       [6.7, 3.1, 4.4, 1.4],

       [5.6, 3. , 4.5, 1.5],

       [5.8, 2.7, 4.1, 1. ],

       [6.2, 2.2, 4.5, 1.5],

       [5.6, 2.5, 3.9, 1.1],

       [5.9, 3.2, 4.8, 1.8],

       [6.1, 2.8, 4. , 1.3],

       [6.3, 2.5, 4.9, 1.5],

       [6.1, 2.8, 4.7, 1.2],

       [6.4, 2.9, 4.3, 1.3],

       [6.6, 3. , 4.4, 1.4],

       [6.8, 2.8, 4.8, 1.4],

       [6.7, 3. , 5. , 1.7],

       [6. , 2.9, 4.5, 1.5],

       [5.7, 2.6, 3.5, 1. ],

       [5.5, 2.4, 3.8, 1.1],

       [5.5, 2.4, 3.7, 1. ],

       [5.8, 2.7, 3.9, 1.2],

       [6. , 2.7, 5.1, 1.6],

       [5.4, 3. , 4.5, 1.5],

       [6. , 3.4, 4.5, 1.6],

       [6.7, 3.1, 4.7, 1.5],

       [6.3, 2.3, 4.4, 1.3],

       [5.6, 3. , 4.1, 1.3],

       [5.5, 2.5, 4. , 1.3],

       [5.5, 2.6, 4.4, 1.2],

       [6.1, 3. , 4.6, 1.4],

       [5.8, 2.6, 4. , 1.2],

       [5. , 2.3, 3.3, 1. ],

       [5.6, 2.7, 4.2, 1.3],

       [5.7, 3. , 4.2, 1.2],

       [5.7, 2.9, 4.2, 1.3],

       [6.2, 2.9, 4.3, 1.3],

       [5.1, 2.5, 3. , 1.1],

       [5.7, 2.8, 4.1, 1.3],

       [6.3, 3.3, 6. , 2.5],

       [5.8, 2.7, 5.1, 1.9],

       [7.1, 3. , 5.9, 2.1],

       [6.3, 2.9, 5.6, 1.8],

       [6.5, 3. , 5.8, 2.2],

       [7.6, 3. , 6.6, 2.1],

       [4.9, 2.5, 4.5, 1.7],

       [7.3, 2.9, 6.3, 1.8],

       [6.7, 2.5, 5.8, 1.8],

       [7.2, 3.6, 6.1, 2.5],

       [6.5, 3.2, 5.1, 2. ],

       [6.4, 2.7, 5.3, 1.9],

       [6.8, 3. , 5.5, 2.1],

       [5.7, 2.5, 5. , 2. ],

       [5.8, 2.8, 5.1, 2.4],

       [6.4, 3.2, 5.3, 2.3],

       [6.5, 3. , 5.5, 1.8],

       [7.7, 3.8, 6.7, 2.2],

       [7.7, 2.6, 6.9, 2.3],

       [6. , 2.2, 5. , 1.5],

       [6.9, 3.2, 5.7, 2.3],

       [5.6, 2.8, 4.9, 2. ],

       [7.7, 2.8, 6.7, 2. ],

       [6.3, 2.7, 4.9, 1.8],

       [6.7, 3.3, 5.7, 2.1],

       [7.2, 3.2, 6. , 1.8],

       [6.2, 2.8, 4.8, 1.8],

       [6.1, 3. , 4.9, 1.8],

       [6.4, 2.8, 5.6, 2.1],

       [7.2, 3. , 5.8, 1.6],

       [7.4, 2.8, 6.1, 1.9],

       [7.9, 3.8, 6.4, 2. ],

       [6.4, 2.8, 5.6, 2.2],

       [6.3, 2.8, 5.1, 1.5],

       [6.1, 2.6, 5.6, 1.4],

       [7.7, 3. , 6.1, 2.3],

       [6.3, 3.4, 5.6, 2.4],

       [6.4, 3.1, 5.5, 1.8],

       [6. , 3. , 4.8, 1.8],

       [6.9, 3.1, 5.4, 2.1],

       [6.7, 3.1, 5.6, 2.4],

       [6.9, 3.1, 5.1, 2.3],

       [5.8, 2.7, 5.1, 1.9],

       [6.8, 3.2, 5.9, 2.3],

       [6.7, 3.3, 5.7, 2.5],

       [6.7, 3. , 5.2, 2.3],

       [6.3, 2.5, 5. , 1.9],

       [6.5, 3. , 5.2, 2. ],

       [6.2, 3.4, 5.4, 2.3],

       [5.9, 3. , 5.1, 1.8]])

In [ ]:

#查看数据形状

iris.data.shape

# 4个特征，150个样本

Out[ ]:

(150, 4)

In [ ]:

# 查看标签数组

iris.target

Out[ ]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [ ]:

# 查看数据的描述

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset

--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)

    :Number of Attributes: 4 numeric, predictive attributes and the class

    :Attribute Information:

        - sepal length in cm

        - sepal width in cm

        - petal length in cm

        - petal width in cm

        - class:

                - Iris-Setosa

                - Iris-Versicolour

                - Iris-Virginica

    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================

                    Min  Max   Mean    SD   Class Correlation

    ============== ==== ==== ======= ===== ====================

    sepal length:   4.3  7.9   5.84   0.83    0.7826

    sepal width:    2.0  4.4   3.05   0.43   -0.4194

    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)

    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None

    :Class Distribution: 33.3% for each of 3 classes.

    :Creator: R.A. Fisher

    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)

    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken

from Fisher's paper. Note that it's the same as in R, but not as in the UCI

Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature.  Fisher's paper is a classic in the field and

is referenced frequently to this day.  (See Duda & Hart, for example.)  The

data set contains 3 classes of 50 instances each, where each class refers to a

type of iris plant.  One class is linearly separable from the other 2; the

latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"

     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to

     Mathematical Statistics" (John Wiley, NY, 1950).

   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.

     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.

   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

     Structure and Classification Rule for Recognition in Partially Exposed

     Environments".  IEEE Transactions on Pattern Analysis and Machine

     Intelligence, Vol. PAMI-2, No. 1, 67-71.

   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions

     on Information Theory, May 1972, 431-433.

   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II

     conceptual clustering system finds 3 classes in the data.

   - Many, many more ...

In [ ]:

# 查看特征名

iris.feature_names

Out[ ]:

['sepal length (cm)',

 'sepal width (cm)',

 'petal length (cm)',

 'petal width (cm)']

In [ ]:

# 查看目标名

iris.target_names

Out[ ]:

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

回归数据集

糖尿病数据集

In [ ]:

from sklearn.datasets import load_diabetes

#实例化

diabetes = load_diabetes()

print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset

----------------

Ten baseline variables, age, sex, body mass index, average blood

pressure, and six blood serum measurements were obtained for each of n =

442 diabetes patients, as well as the response of interest, a

quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:

      - age     age in years

      - sex

      - bmi     body mass index

      - bp      average blood pressure

      - s1      tc, total serum cholesterol

      - s2      ldl, low-density lipoproteins

      - s3      hdl, high-density lipoproteins

      - s4      tch, total cholesterol / HDL

      - s5      ltg, possibly log of serum triglycerides level

      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [ ]:

diabetes.data

Out[ ]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,

         0.01990749, -0.01764613],

       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,

        -0.06833155, -0.09220405],

       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,

         0.00286131, -0.02593034],

       ...,

       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,

        -0.04688253,  0.01549073],

       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,

         0.04452873, -0.02593034],

       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,

        -0.00422151,  0.00306441]])

In [ ]:

diabetes.target

Out[ ]:

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,

        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,

        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,

        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,

       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,

       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,

       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,

       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,

        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,

        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,

       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,

       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,

       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,

        60., 174., 259., 178., 128.,  96., 126., 288.,  88., 292.,  71.,

       197., 186.,  25.,  84.,  96., 195.,  53., 217., 172., 131., 214.,

        59.,  70., 220., 268., 152.,  47.,  74., 295., 101., 151., 127.,

       237., 225.,  81., 151., 107.,  64., 138., 185., 265., 101., 137.,

       143., 141.,  79., 292., 178.,  91., 116.,  86., 122.,  72., 129.,

       142.,  90., 158.,  39., 196., 222., 277.,  99., 196., 202., 155.,

        77., 191.,  70.,  73.,  49.,  65., 263., 248., 296., 214., 185.,

        78.,  93., 252., 150.,  77., 208.,  77., 108., 160.,  53., 220.,

       154., 259.,  90., 246., 124.,  67.,  72., 257., 262., 275., 177.,

        71.,  47., 187., 125.,  78.,  51., 258., 215., 303., 243.,  91.,

       150., 310., 153., 346.,  63.,  89.,  50.,  39., 103., 308., 116.,

       145.,  74.,  45., 115., 264.,  87., 202., 127., 182., 241.,  66.,

        94., 283.,  64., 102., 200., 265.,  94., 230., 181., 156., 233.,

        60., 219.,  80.,  68., 332., 248.,  84., 200.,  55.,  85.,  89.,

        31., 129.,  83., 275.,  65., 198., 236., 253., 124.,  44., 172.,

       114., 142., 109., 180., 144., 163., 147.,  97., 220., 190., 109.,

       191., 122., 230., 242., 248., 249., 192., 131., 237.,  78., 135.,

       244., 199., 270., 164.,  72.,  96., 306.,  91., 214.,  95., 216.,

       263., 178., 113., 200., 139., 139.,  88., 148.,  88., 243.,  71.,

        77., 109., 272.,  60.,  54., 221.,  90., 311., 281., 182., 321.,

        58., 262., 206., 233., 242., 123., 167.,  63., 197.,  71., 168.,

       140., 217., 121., 235., 245.,  40.,  52., 104., 132.,  88.,  69.,

       219.,  72., 201., 110.,  51., 277.,  63., 118.,  69., 273., 258.,

        43., 198., 242., 232., 175.,  93., 168., 275., 293., 281.,  72.,

       140., 189., 181., 209., 136., 261., 113., 131., 174., 257.,  55.,

        84.,  42., 146., 212., 233.,  91., 111., 152., 120.,  67., 310.,

        94., 183.,  66., 173.,  72.,  49.,  64.,  48., 178., 104., 132.,

       220.,  57.])

In [ ]:

print("糖尿病数据集的特征名：{0}\n糖尿病数据集的目标值文件名：{1}".format(diabetes.feature_names, diabetes.target_filename))

糖尿病数据集的特征名：['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

糖尿病数据集的目标值文件名：diabetes_target.csv.gz

数据集分割（训练集与测试集）

In [ ]:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.25)

Out[ ]:

(111,)

机器学习基础04DAY的更多相关文章

Coursera 机器学习课程机器学习基础：案例研究证书
完成了课程1 机器学习基础:案例研究贴个证书,继续努力完成后续的课程:
Coursera台大机器学习基础课程1
Coursera台大机器学习基础课程学习笔记 -- 1 最近在跟台大的这个课程,觉得不错,想把学习笔记发出来跟大家分享下,有错误希望大家指正. 一机器学习是什么? 感觉和 Tom M. Mitche ...
机器学习 —— 基础整理（六）线性判别函数：感知器、松弛算法、Ho-Kashyap算法
这篇总结继续复习分类问题.本文简单整理了以下内容: (一)线性判别函数与广义线性判别函数 (二)感知器 (三)松弛算法 (四)Ho-Kashyap算法闲话:本篇是本系列［机器学习基础整理］在time ...
算法工程师<机器学习基础>
<机器学习基础> 逻辑回归,SVM,决策树 1.逻辑回归和SVM的区别是什么?各适用于解决什么问题? https://www.zhihu.com/question/24904422 2.L ...
数据分析之Matplotlib和机器学习基础
一.Matplotlib基础知识 Matplotlib 是一个 Python 的 2D绘图库,它以各种硬拷贝格式和跨平台的交互式环境生成出版质量级别的图形. 通过 Matplotlib,开发者可以仅需 ...
【dlbook】机器学习基础
[机器学习基础] 模型的 vc dimension 如何衡量? 如何根据网络结构衡量模型容量?有效容量和模型容量之间的关系? 统计学习理论中边界不用于深度学习之中,原因? 1.边界通常比较松, 2.深 ...
Python机器学习基础教程-第2章-监督学习之决策树集成
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程-第2章-监督学习之决策树
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程-第2章-监督学习之线性模型
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程-第2章-监督学习之K近邻
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...

随机推荐

python学习笔记-简介
python简介 python是一种简单易学,功能强大的编程语言,他有高效的高层数据结构,简单而有效的实现面向对象编程.python是一种解释性语言,在多数平台的多个领域都是理想的脚本语言,特别适用于 ...
WPF Binding表达式
前言: WPF BindingBinding表达式的使用,可以很方便的绑定参数和更新界面数据. 1.界面添加控件,并设置对应属性的Binding表达式,例如: <Window x:Class=& ...
安装python及环境搭建
操作系统是windows7 64位打算使用visual studio code进行代码编写 1.先安装visual studio code去visual studio 官网下载VS code htt ...
SQL之rand,round,floor,ceiling,cast小数处理函数
rand():取随机数,select rand() from T 结果:0.635811742495648 round():保留N位小数,四舍五入 select round(1.0446,N) flo ...
openSUSE 15.4 安装 Deepin Wine QQ
1. 准备: deepin-wine5 deepin-wine-qq deepin-wine-helper 这三个包我是在openSUSE网站上搜索到的,https://software.opensu ...
Ribbon负载均衡的实现流程简要分析
SpringCloud中使用Netflix方案做分布式时,只需要在RestTemplate的bean定义上加一个注解@LoadBalanced,无需做其它任何操作就可以开启负载均衡,怎么做到的呢? 不 ...
k8s配置拉取镜像密钥
一.部署步骤 1.创建阿里云镜像仓库 2.创建Secret绑定镜像仓库账号 3.创建Deployment绑定Secret 二.创建阿里云镜像仓库 1.进入阿里云容器镜像服务,创建个人版实例 2.设置登 ...
rgb变为灰度图像
close all;clc; x = imread('C:\timg.jpg'); %读取rgb图片信息I = rgb2gray(x);%将rgb图像转化为灰度图像 set(0,'defaultFig ...
linux sync命令
Linux sync命令用于数据同步,sync命令是在关闭Linux系统时使用的. Linux 系统中欲写入硬盘的资料有的时候为了效率起见,会写到 filesystem buffer 中,这个 buf ...
NET Core 部署IIS 碰到得问题解决（内托管模式超时、不允许得请求谓词、直接请求无响应、拒绝服务405）
web.config 配置说明典型的web.confg 配置. 注意其中hostingModel模式和requestTimeout 进程内托管需要注意使用单独的应用程序池: 请求超时默认5分钟,出错 ...

机器学习基础04DAY

scikit-learn数据集

sklearn.datasets

获取小数据集

获取大数据集

获取本地生成数据

数据的分类与划分

数据集返回的类型

分类数据集

鸢尾花数据集

回归数据集

糖尿病数据集

数据集分割（训练集与测试集）

机器学习基础04DAY的更多相关文章

随机推荐

热门专题