scikit-learn数据集

我们将介绍sklearn中的数据集类，模块包括用于加载数据集的实用程序，包括加载和获取流行参考数据集的方法。它还具有一些人工数据生成器。

sklearn.datasets

（1）datasets.load_*()

获取小规模数据集，数据包含在datasets里

（2）datasets.fetch_*()

获取大规模数据集，需要从网络上下载，函数的第一个参数是data_home，表示数据集下载的目录，默认是 ~/scikit_learn_data/，要修改默认目录，可以修改环境变量SCIKIT_LEARN_DATA

（3）datasets.make_*()

本地生成数据集

load*和 fetch* 函数返回的数据类型是 datasets.base.Bunch，本质上是一个 dict，它的键值对可用通过对象的属性方式访问。主要包含以下属性：

data：特征数据数组，是 n_samples * n_features 的二维 numpy.ndarray 数组
target：标签数组，是 n_samples 的一维 numpy.ndarray 数组
DESCR：数据描述
feature_names：特征名
target_names：标签名

数据集目录可以通过datasets.get_data_home()获取，clear_data_home(data_home=None)删除所有下载数据

datasets.get_data_home(data_home=None)

返回scikit学习数据目录的路径。这个文件夹被一些大的数据集装载器使用，以避免下载数据。默认情况下，数据目录设置为用户主文件夹中名为“scikit_learn_data”的文件夹。或者，可以通过“SCIKIT_LEARN_DATA”环境变量或通过给出显式的文件夹路径以编程方式设置它。'〜'符号扩展到用户主文件夹。如果文件夹不存在，则会自动创建。

sklearn.datasets.clear_data_home(data_home=None)

删除存储目录中的数据

获取小数据集

用于分类

sklearn.datasets.load_iris

class sklearn.datasets.load_iris(return_X_y=False)

  """

  加载并返回虹膜数据集

  :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

  :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

  """

In [12]: from sklearn.datasets import load_iris

    ...: data = load_iris()

    ...:

In [13]: data.target

Out[13]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [14]: data.feature_names

Out[14]:

['sepal length (cm)',

 'sepal width (cm)',

 'petal length (cm)',

 'petal width (cm)']

In [15]: data.target_names

Out[15]:

array(['setosa', 'versicolor', 'virginica'],

      dtype='|S10')

In [17]: data.target[[1,10, 100]]

Out[17]: array([0, 0, 2])

名称	数量
类别	3
特征	4
样本数量	150
每个类别数量	50

sklearn.datasets.load_digits

class sklearn.datasets.load_digits(n_class=10, return_X_y=False)

    """

    加载并返回数字数据集

    :param n_class: 整数，介于0和10之间，可选（默认= 10，要返回的类的数量

    :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

    :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

    """

In [20]: from sklearn.datasets import load_digits

In [21]: digits = load_digits()

In [22]: print(digits.data.shape)

(1797, 64)

In [23]: digits.target

Out[23]: array([0, 1, 2, ..., 8, 9, 8])

In [24]: digits.target_names

Out[24]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [25]: digits.images

Out[25]:

array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],

        [  0.,   0.,  13., ...,  15.,   5.,   0.],

        [  0.,   3.,  15., ...,  11.,   8.,   0.],

        ...,

        [  0.,   4.,  11., ...,  12.,   7.,   0.],

        [  0.,   2.,  14., ...,  12.,   0.,   0.],

        [  0.,   0.,   6., ...,   0.,   0.,   0.]],

        [[  0.,   0.,  10., ...,   1.,   0.,   0.],

        [  0.,   2.,  16., ...,   1.,   0.,   0.],

        [  0.,   0.,  15., ...,  15.,   0.,   0.],

        ...,

        [  0.,   4.,  16., ...,  16.,   6.,   0.],

        [  0.,   8.,  16., ...,  16.,   8.,   0.],

        [  0.,   1.,   8., ...,  12.,   1.,   0.]]])

名称	数量
类别	10
特征	64
样本数量	1797

用于回归

sklearn.datasets.load_boston

class  sklearn.datasets.load_boston(return_X_y=False)

  """

  加载并返回波士顿房价数据集

  :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

  :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

  """

In [34]: from sklearn.datasets import load_boston

In [35]: boston = load_boston()

In [36]: boston.data.shape

Out[36]: (506, 13)

In [37]: boston.feature_names

Out[37]:

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',

       'TAX', 'PTRATIO', 'B', 'LSTAT'],

      dtype='|S7')

In [38]:

名称	数量
目标类别	5-50
特征	13
样本数量	506

sklearn.datasets.load_diabetes

class sklearn.datasets.load_diabetes(return_X_y=False)

  """

  加载和返回糖尿病数据集

  :param return_X_y: 如果为True，则返回而不是Bunch对象，默认为False

  :return: Bunch对象，如果return_X_y为True，那么返回tuple，（data,target）

  """

In [13]:  from sklearn.datasets import load_diabetes

In [14]: diabetes = load_diabetes()

In [15]: diabetes.data

Out[15]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,

         0.01990842, -0.01764613],

       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,

        -0.06832974, -0.09220405],

       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,

         0.00286377, -0.02593034],

       ...,

       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,

        -0.04687948,  0.01549073],

       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,

         0.04452837, -0.02593034],

       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,

        -0.00421986,  0.00306441]])

名称	数量
目标范围	25-346
特征	10
样本数量	442

获取大数据集

sklearn.datasets.fetch_20newsgroups

class sklearn.datasets.fetch_20newsgroups(data_home=None, subset='train', categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)

  """

  加载20个新闻组数据集中的文件名和数据

  :param subset: 'train'或者'test','all'，可选，选择要加载的数据集：训练集的“训练”，测试集的“测试”，两者的“全部”，具有洗牌顺序

  :param data_home: 可选，默认值：无，指定数据集的下载和缓存文件夹。如果没有，所有scikit学习数据都存储在'〜/ scikit_learn_data'子文件夹中

  :param categories: 无或字符串或Unicode的集合，如果没有（默认），加载所有类别。如果不是无，要加载的类别名称列表（忽略其他类别）

  :param shuffle: 是否对数据进行洗牌

  :param random_state: numpy随机数生成器或种子整数

  :param download_if_missing: 可选，默认为True，如果False，如果数据不在本地可用而不是尝试从源站点下载数据，则引发IOError

  :param remove: 元组

  """

In [29]: from sklearn.datasets import fetch_20newsgroups

In [30]: data_test = fetch_20newsgroups(subset='test',shuffle=True, random_sta

    ...: te=42)

In [31]: data_train = fetch_20newsgroups(subset='train',shuffle=True, random_s

    ...: tate=42)

sklearn.datasets.fetch_20newsgroups_vectorized

class sklearn.datasets.fetch_20newsgroups_vectorized(subset='train', remove=(), data_home=None)

  """

  加载20个新闻组数据集并将其转换为tf-idf向量，这是一个方便的功能; 使用sklearn.feature_extraction.text.Vectorizer的默认设置完成tf-idf 转换。对于更高级的使用（停止词过滤，n-gram提取等），将fetch_20newsgroup与自定义Vectorizer或CountVectorizer组合在一起

  :param subset: 'train'或者'test','all'，可选，选择要加载的数据集：训练集的“训练”，测试集的“测试”，两者的“全部”，具有洗牌顺序

  :param data_home: 可选，默认值：无，指定数据集的下载和缓存文件夹。如果没有，所有scikit学习数据都存储在'〜/ scikit_learn_data'子文件夹中

  :param remove: 元组

  """

In [57]: from sklearn.datasets import fetch_20newsgroups_vectorized

In [58]: bunch = fetch_20newsgroups_vectorized(subset='all')

In [59]: from sklearn.utils import shuffle

In [60]: X, y = shuffle(bunch.data, bunch.target)

    ...: offset = int(X.shape[0] * 0.8)

    ...: X_train, y_train = X[:offset], y[:offset]

    ...: X_test, y_test = X[offset:], y[offset:]

    ...:

获取本地生成数据

生成本地分类数据：

sklearn.datasets.make_classification

class make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

"""

生成用于分类的数据集

:param n_samples:int，optional（default = 100)，样本数量

:param n_features:int，可选（默认= 20），特征总数

:param n_classes:int，可选（default = 2),类（或标签）的分类问题的数量

:param random_state:int，RandomState实例或无，可选（默认=无）

  如果int，random_state是随机数生成器使用的种子; 如果RandomState的实例，random_state是随机数生成器; 如果没有，随机数生成器所使用的RandomState实例np.random

:return :X,特征数据集；y,目标分类值

"""

from sklearn.datasets.samples_generator import make_classification

X,y= datasets.make_classification(n_samples=100000, n_features=20,n_informative=2, n_redundant=10,random_state=42)

生成本地回归数据：

sklearn.datasets.make_regression

class make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

  """

  生成用于回归的数据集

  :param n_samples:int，optional（default = 100)，样本数量

  :param  n_features:int,optional（default = 100)，特征数量

  :param  coef:boolean，optional（default = False），如果为True，则返回底层线性模型的系数

  :param random_state:int，RandomState实例或无，可选（默认=无）

    如果int，random_state是随机数生成器使用的种子; 如果RandomState的实例，random_state是随机数生成器; 如果没有，随机数生成器所使用的RandomState实例np.random

  :return :X,特征数据集；y,目标值

  """

from sklearn.datasets.samples_generator import make_regression

X, y = make_regression(n_samples=200, n_features=5000, random_state=42)

数据的分类与划分

数据集返回的类型

load和fetch返回的数据类型datasets.base.Bunch（字典格式）
data: 特征数据数组，是[n_samples * n_features]的二维numpy.ndarray数组
target:标签数组，是n_samples的一维numpy.ndarray数组
DESCR:数据描述
feature_names: 特征名,新闻数据，手写数字、回归数据集没有
target_names:标签名,回归数据集没有

分类数据集

鸢尾花数据集

In [ ]:

from sklearn.datasets import load_iris

# 实例化数据集

iris = load_iris()

#查看特征值

iris.data

Out[ ]:

array([[5.1, 3.5, 1.4, 0.2],

       [4.9, 3. , 1.4, 0.2],

       [4.7, 3.2, 1.3, 0.2],

       [4.6, 3.1, 1.5, 0.2],

       [5. , 3.6, 1.4, 0.2],

       [5.4, 3.9, 1.7, 0.4],

       [4.6, 3.4, 1.4, 0.3],

       [5. , 3.4, 1.5, 0.2],

       [4.4, 2.9, 1.4, 0.2],

       [4.9, 3.1, 1.5, 0.1],

       [5.4, 3.7, 1.5, 0.2],

       [4.8, 3.4, 1.6, 0.2],

       [4.8, 3. , 1.4, 0.1],

       [4.3, 3. , 1.1, 0.1],

       [5.8, 4. , 1.2, 0.2],

       [5.7, 4.4, 1.5, 0.4],

       [5.4, 3.9, 1.3, 0.4],

       [5.1, 3.5, 1.4, 0.3],

       [5.7, 3.8, 1.7, 0.3],

       [5.1, 3.8, 1.5, 0.3],

       [5.4, 3.4, 1.7, 0.2],

       [5.1, 3.7, 1.5, 0.4],

       [4.6, 3.6, 1. , 0.2],

       [5.1, 3.3, 1.7, 0.5],

       [4.8, 3.4, 1.9, 0.2],

       [5. , 3. , 1.6, 0.2],

       [5. , 3.4, 1.6, 0.4],

       [5.2, 3.5, 1.5, 0.2],

       [5.2, 3.4, 1.4, 0.2],

       [4.7, 3.2, 1.6, 0.2],

       [4.8, 3.1, 1.6, 0.2],

       [5.4, 3.4, 1.5, 0.4],

       [5.2, 4.1, 1.5, 0.1],

       [5.5, 4.2, 1.4, 0.2],

       [4.9, 3.1, 1.5, 0.2],

       [5. , 3.2, 1.2, 0.2],

       [5.5, 3.5, 1.3, 0.2],

       [4.9, 3.6, 1.4, 0.1],

       [4.4, 3. , 1.3, 0.2],

       [5.1, 3.4, 1.5, 0.2],

       [5. , 3.5, 1.3, 0.3],

       [4.5, 2.3, 1.3, 0.3],

       [4.4, 3.2, 1.3, 0.2],

       [5. , 3.5, 1.6, 0.6],

       [5.1, 3.8, 1.9, 0.4],

       [4.8, 3. , 1.4, 0.3],

       [5.1, 3.8, 1.6, 0.2],

       [4.6, 3.2, 1.4, 0.2],

       [5.3, 3.7, 1.5, 0.2],

       [5. , 3.3, 1.4, 0.2],

       [7. , 3.2, 4.7, 1.4],

       [6.4, 3.2, 4.5, 1.5],

       [6.9, 3.1, 4.9, 1.5],

       [5.5, 2.3, 4. , 1.3],

       [6.5, 2.8, 4.6, 1.5],

       [5.7, 2.8, 4.5, 1.3],

       [6.3, 3.3, 4.7, 1.6],

       [4.9, 2.4, 3.3, 1. ],

       [6.6, 2.9, 4.6, 1.3],

       [5.2, 2.7, 3.9, 1.4],

       [5. , 2. , 3.5, 1. ],

       [5.9, 3. , 4.2, 1.5],

       [6. , 2.2, 4. , 1. ],

       [6.1, 2.9, 4.7, 1.4],

       [5.6, 2.9, 3.6, 1.3],

       [6.7, 3.1, 4.4, 1.4],

       [5.6, 3. , 4.5, 1.5],

       [5.8, 2.7, 4.1, 1. ],

       [6.2, 2.2, 4.5, 1.5],

       [5.6, 2.5, 3.9, 1.1],

       [5.9, 3.2, 4.8, 1.8],

       [6.1, 2.8, 4. , 1.3],

       [6.3, 2.5, 4.9, 1.5],

       [6.1, 2.8, 4.7, 1.2],

       [6.4, 2.9, 4.3, 1.3],

       [6.6, 3. , 4.4, 1.4],

       [6.8, 2.8, 4.8, 1.4],

       [6.7, 3. , 5. , 1.7],

       [6. , 2.9, 4.5, 1.5],

       [5.7, 2.6, 3.5, 1. ],

       [5.5, 2.4, 3.8, 1.1],

       [5.5, 2.4, 3.7, 1. ],

       [5.8, 2.7, 3.9, 1.2],

       [6. , 2.7, 5.1, 1.6],

       [5.4, 3. , 4.5, 1.5],

       [6. , 3.4, 4.5, 1.6],

       [6.7, 3.1, 4.7, 1.5],

       [6.3, 2.3, 4.4, 1.3],

       [5.6, 3. , 4.1, 1.3],

       [5.5, 2.5, 4. , 1.3],

       [5.5, 2.6, 4.4, 1.2],

       [6.1, 3. , 4.6, 1.4],

       [5.8, 2.6, 4. , 1.2],

       [5. , 2.3, 3.3, 1. ],

       [5.6, 2.7, 4.2, 1.3],

       [5.7, 3. , 4.2, 1.2],

       [5.7, 2.9, 4.2, 1.3],

       [6.2, 2.9, 4.3, 1.3],

       [5.1, 2.5, 3. , 1.1],

       [5.7, 2.8, 4.1, 1.3],

       [6.3, 3.3, 6. , 2.5],

       [5.8, 2.7, 5.1, 1.9],

       [7.1, 3. , 5.9, 2.1],

       [6.3, 2.9, 5.6, 1.8],

       [6.5, 3. , 5.8, 2.2],

       [7.6, 3. , 6.6, 2.1],

       [4.9, 2.5, 4.5, 1.7],

       [7.3, 2.9, 6.3, 1.8],

       [6.7, 2.5, 5.8, 1.8],

       [7.2, 3.6, 6.1, 2.5],

       [6.5, 3.2, 5.1, 2. ],

       [6.4, 2.7, 5.3, 1.9],

       [6.8, 3. , 5.5, 2.1],

       [5.7, 2.5, 5. , 2. ],

       [5.8, 2.8, 5.1, 2.4],

       [6.4, 3.2, 5.3, 2.3],

       [6.5, 3. , 5.5, 1.8],

       [7.7, 3.8, 6.7, 2.2],

       [7.7, 2.6, 6.9, 2.3],

       [6. , 2.2, 5. , 1.5],

       [6.9, 3.2, 5.7, 2.3],

       [5.6, 2.8, 4.9, 2. ],

       [7.7, 2.8, 6.7, 2. ],

       [6.3, 2.7, 4.9, 1.8],

       [6.7, 3.3, 5.7, 2.1],

       [7.2, 3.2, 6. , 1.8],

       [6.2, 2.8, 4.8, 1.8],

       [6.1, 3. , 4.9, 1.8],

       [6.4, 2.8, 5.6, 2.1],

       [7.2, 3. , 5.8, 1.6],

       [7.4, 2.8, 6.1, 1.9],

       [7.9, 3.8, 6.4, 2. ],

       [6.4, 2.8, 5.6, 2.2],

       [6.3, 2.8, 5.1, 1.5],

       [6.1, 2.6, 5.6, 1.4],

       [7.7, 3. , 6.1, 2.3],

       [6.3, 3.4, 5.6, 2.4],

       [6.4, 3.1, 5.5, 1.8],

       [6. , 3. , 4.8, 1.8],

       [6.9, 3.1, 5.4, 2.1],

       [6.7, 3.1, 5.6, 2.4],

       [6.9, 3.1, 5.1, 2.3],

       [5.8, 2.7, 5.1, 1.9],

       [6.8, 3.2, 5.9, 2.3],

       [6.7, 3.3, 5.7, 2.5],

       [6.7, 3. , 5.2, 2.3],

       [6.3, 2.5, 5. , 1.9],

       [6.5, 3. , 5.2, 2. ],

       [6.2, 3.4, 5.4, 2.3],

       [5.9, 3. , 5.1, 1.8]])

In [ ]:

#查看数据形状

iris.data.shape

# 4个特征，150个样本

Out[ ]:

(150, 4)

In [ ]:

# 查看标签数组

iris.target

Out[ ]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [ ]:

# 查看数据的描述

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset

--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)

    :Number of Attributes: 4 numeric, predictive attributes and the class

    :Attribute Information:

        - sepal length in cm

        - sepal width in cm

        - petal length in cm

        - petal width in cm

        - class:

                - Iris-Setosa

                - Iris-Versicolour

                - Iris-Virginica

    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================

                    Min  Max   Mean    SD   Class Correlation

    ============== ==== ==== ======= ===== ====================

    sepal length:   4.3  7.9   5.84   0.83    0.7826

    sepal width:    2.0  4.4   3.05   0.43   -0.4194

    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)

    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None

    :Class Distribution: 33.3% for each of 3 classes.

    :Creator: R.A. Fisher

    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)

    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken

from Fisher's paper. Note that it's the same as in R, but not as in the UCI

Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature.  Fisher's paper is a classic in the field and

is referenced frequently to this day.  (See Duda & Hart, for example.)  The

data set contains 3 classes of 50 instances each, where each class refers to a

type of iris plant.  One class is linearly separable from the other 2; the

latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"

     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to

     Mathematical Statistics" (John Wiley, NY, 1950).

   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.

     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.

   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

     Structure and Classification Rule for Recognition in Partially Exposed

     Environments".  IEEE Transactions on Pattern Analysis and Machine

     Intelligence, Vol. PAMI-2, No. 1, 67-71.

   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions

     on Information Theory, May 1972, 431-433.

   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II

     conceptual clustering system finds 3 classes in the data.

   - Many, many more ...

In [ ]:

# 查看特征名

iris.feature_names

Out[ ]:

['sepal length (cm)',

 'sepal width (cm)',

 'petal length (cm)',

 'petal width (cm)']

In [ ]:

# 查看目标名

iris.target_names

Out[ ]:

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

回归数据集

糖尿病数据集

In [ ]:

from sklearn.datasets import load_diabetes

#实例化

diabetes = load_diabetes()

print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset

----------------

Ten baseline variables, age, sex, body mass index, average blood

pressure, and six blood serum measurements were obtained for each of n =

442 diabetes patients, as well as the response of interest, a

quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:

      - age     age in years

      - sex

      - bmi     body mass index

      - bp      average blood pressure

      - s1      tc, total serum cholesterol

      - s2      ldl, low-density lipoproteins

      - s3      hdl, high-density lipoproteins

      - s4      tch, total cholesterol / HDL

      - s5      ltg, possibly log of serum triglycerides level

      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [ ]:

diabetes.data

Out[ ]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,

         0.01990749, -0.01764613],

       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,

        -0.06833155, -0.09220405],

       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,

         0.00286131, -0.02593034],

       ...,

       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,

        -0.04688253,  0.01549073],

       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,

         0.04452873, -0.02593034],

       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,

        -0.00422151,  0.00306441]])

In [ ]:

diabetes.target

Out[ ]:

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,

        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,

        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,

        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,

       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,

       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,

       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,

       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,

        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,

        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,

       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,

       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,

       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,

        60., 174., 259., 178., 128.,  96., 126., 288.,  88., 292.,  71.,

       197., 186.,  25.,  84.,  96., 195.,  53., 217., 172., 131., 214.,

        59.,  70., 220., 268., 152.,  47.,  74., 295., 101., 151., 127.,

       237., 225.,  81., 151., 107.,  64., 138., 185., 265., 101., 137.,

       143., 141.,  79., 292., 178.,  91., 116.,  86., 122.,  72., 129.,

       142.,  90., 158.,  39., 196., 222., 277.,  99., 196., 202., 155.,

        77., 191.,  70.,  73.,  49.,  65., 263., 248., 296., 214., 185.,

        78.,  93., 252., 150.,  77., 208.,  77., 108., 160.,  53., 220.,

       154., 259.,  90., 246., 124.,  67.,  72., 257., 262., 275., 177.,

        71.,  47., 187., 125.,  78.,  51., 258., 215., 303., 243.,  91.,

       150., 310., 153., 346.,  63.,  89.,  50.,  39., 103., 308., 116.,

       145.,  74.,  45., 115., 264.,  87., 202., 127., 182., 241.,  66.,

        94., 283.,  64., 102., 200., 265.,  94., 230., 181., 156., 233.,

        60., 219.,  80.,  68., 332., 248.,  84., 200.,  55.,  85.,  89.,

        31., 129.,  83., 275.,  65., 198., 236., 253., 124.,  44., 172.,

       114., 142., 109., 180., 144., 163., 147.,  97., 220., 190., 109.,

       191., 122., 230., 242., 248., 249., 192., 131., 237.,  78., 135.,

       244., 199., 270., 164.,  72.,  96., 306.,  91., 214.,  95., 216.,

       263., 178., 113., 200., 139., 139.,  88., 148.,  88., 243.,  71.,

        77., 109., 272.,  60.,  54., 221.,  90., 311., 281., 182., 321.,

        58., 262., 206., 233., 242., 123., 167.,  63., 197.,  71., 168.,

       140., 217., 121., 235., 245.,  40.,  52., 104., 132.,  88.,  69.,

       219.,  72., 201., 110.,  51., 277.,  63., 118.,  69., 273., 258.,

        43., 198., 242., 232., 175.,  93., 168., 275., 293., 281.,  72.,

       140., 189., 181., 209., 136., 261., 113., 131., 174., 257.,  55.,

        84.,  42., 146., 212., 233.,  91., 111., 152., 120.,  67., 310.,

        94., 183.,  66., 173.,  72.,  49.,  64.,  48., 178., 104., 132.,

       220.,  57.])

In [ ]:

print("糖尿病数据集的特征名：{0}\n糖尿病数据集的目标值文件名：{1}".format(diabetes.feature_names, diabetes.target_filename))

糖尿病数据集的特征名：['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

糖尿病数据集的目标值文件名：diabetes_target.csv.gz

数据集分割（训练集与测试集）

In [ ]:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.25)

Out[ ]:

(111,)

机器学习基础04DAY的更多相关文章

Coursera 机器学习课程机器学习基础：案例研究证书
完成了课程1 机器学习基础:案例研究贴个证书,继续努力完成后续的课程:
Coursera台大机器学习基础课程1
Coursera台大机器学习基础课程学习笔记 -- 1 最近在跟台大的这个课程,觉得不错,想把学习笔记发出来跟大家分享下,有错误希望大家指正. 一机器学习是什么? 感觉和 Tom M. Mitche ...
机器学习 —— 基础整理（六）线性判别函数：感知器、松弛算法、Ho-Kashyap算法
这篇总结继续复习分类问题.本文简单整理了以下内容: (一)线性判别函数与广义线性判别函数 (二)感知器 (三)松弛算法 (四)Ho-Kashyap算法闲话:本篇是本系列［机器学习基础整理］在time ...
算法工程师<机器学习基础>
<机器学习基础> 逻辑回归,SVM,决策树 1.逻辑回归和SVM的区别是什么?各适用于解决什么问题? https://www.zhihu.com/question/24904422 2.L ...
数据分析之Matplotlib和机器学习基础
一.Matplotlib基础知识 Matplotlib 是一个 Python 的 2D绘图库,它以各种硬拷贝格式和跨平台的交互式环境生成出版质量级别的图形. 通过 Matplotlib,开发者可以仅需 ...
【dlbook】机器学习基础
[机器学习基础] 模型的 vc dimension 如何衡量? 如何根据网络结构衡量模型容量?有效容量和模型容量之间的关系? 统计学习理论中边界不用于深度学习之中,原因? 1.边界通常比较松, 2.深 ...
Python机器学习基础教程-第2章-监督学习之决策树集成
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程-第2章-监督学习之决策树
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程-第2章-监督学习之线性模型
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程-第2章-监督学习之K近邻
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...

随机推荐

开发Unity3D移动端输入插件 UGUI Touch Input Component
UGUI Touch Input Component 为了在移动设备上操控角色,本人便开发了UGUI Touch Input Component输入类插件. 特点本插件中总共包含三种组件:the v ...
【python】绘图，画虚线
linestyle='--' plot画线时候加linestyle='--'. 参考:python 画图-标注点,画虚线_GXLiu-CSDN博客_python画虚线
gets,fgets,getchar,fgetc
以上四个函数都是读取外部输入的函数.可以使stdin,也可以是文件.以下都是在C语言中的应用关于gets和fgets都能够读取一行,一行结束的标志是"回车".都有弊端gets(s ...
【已解决】Jenkins构建成功但发送邮件失败，报错“Not sending mail to unregistered user xxx@xxx.com because your SCM claimed this was associated with a user ID ‘xxx which your security realm does not recognize; ”
问题描述:构建成后,但发送邮件失败,具体报错截图如下: 原因:用户在jenkins中名称与发送邮件汇总设置不一样且没有勾选"Allow sending to unregistered use ...
AI基本知识
一.什么是flops 对flops有疑惑,首先得先捋清这个概念: FLOPS:注意全大写,是floating point operations per second的缩写,意指每秒浮点运算次数,理解为 ...
爱心代码_HTML
直接上效果 <!doctype html> <html> <head> <meta charset="utf-8"> <tit ...
使用idea2021.1.3新建一个Web项目教程
使用idea2021.1.3新建一个Web项目教程文章目录一.新建项目二.在WEB-INF下创建classes,lib文件夹三.配置WEB容器(tomcat Server) 一.新建项目点击 ...
井字棋判断输赢C
#include <stdio.h> int main(){ char a[3][3]; for (int i = 0; i < 3; ++i) { for (int j = 0; ...
StringIO 和 BytesIO
StringIO 要把 str 字符串写入内存中,我们需要创建一个 StringIO 对象,然后像文件一样对读取内容.其中 StringIO 中多了一个 getvalue() 方法,目的是用于获取写入 ...
Delphi 关于RichEdit URL 颜色相关总结
一.代码改变字体大小和颜色 1 procedure TForm1.Button1Click(Sender: TObject); 2 var 3 sNickName, sstr: string; 4 b ...

机器学习基础04DAY

scikit-learn数据集

sklearn.datasets

获取小数据集

获取大数据集

获取本地生成数据

数据的分类与划分

数据集返回的类型

分类数据集

鸢尾花数据集

回归数据集

糖尿病数据集

数据集分割（训练集与测试集）

机器学习基础04DAY的更多相关文章

随机推荐

热门专题