样本示意,为kdd99数据源:

0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,domain_u,SF,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0.00,0.00,0.00,0.00,0.50,1.00,0.00,10,3,0.30,0.30,0.30,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,71,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,230,260,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,19,0.00,0.00,0.00,0.00,1.00,0.00,0.11,3,255,1.00,0.00,0.33,0.07,0.33,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,252,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
1,tcp,smtp,SF,3170,329,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,54,39,0.72,0.11,0.02,0.00,0.02,0.00,0.09,0.13,normal.
0,tcp,http,SF,297,13787,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,177,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,291,3542,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,12,12,0.00,0.00,0.00,0.00,1.00,0.00,0.00,187,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,295,753,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,21,22,0.00,0.00,0.00,0.00,1.00,0.00,0.09,196,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,268,9235,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.00,0.00,0.00,0.00,1.00,0.00,0.00,58,255,1.00,0.00,0.02,0.05,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,227,8841,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,13,13,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,222,19564,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,22,23,0.00,0.00,0.00,0.00,1.00,0.00,0.09,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,740,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,77,33,0.34,0.08,0.34,0.06,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,35195,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,10,0.00,0.00,0.00,0.00,1.00,0.00,0.00,92,44,0.43,0.07,0.43,0.05,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,8325,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,20,0.00,0.00,0.00,0.00,1.00,0.00,0.00,103,54,0.49,0.06,0.49,0.04,0.00,0.00,0.00,0.00,normal.

代码:

# -*- coding:utf-8 -*-

import re
import matplotlib.pyplot as plt
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn import cross_validation
import os
from sklearn.datasets import load_iris
from sklearn import tree
import pydotplus
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
from sklearn_pandas import DataFrameMapper def label(x):
if x == "normal.":
return 0
else:
return 1 if __name__ == '__main__':
data = pd.read_csv('../data/kddcup99/corrected', sep=",", header=None)
print data.columns
print data.iloc[0,0], data.iloc[0,1]
print len(data)
col_cnt = len(data.columns) normal = data.loc[data.loc[:, col_cnt-1] == "normal.", :]
print "normal len:", len(normal)
guess = data.loc[data.loc[:, col_cnt-1] == "guess_passwd.", :]
print "normal len:", len(guess) data = pd.concat([normal, guess])
print len(data) le = preprocessing.LabelEncoder()
for i in range(col_cnt-1):
if isinstance(data.iloc[0,i], str):
print "tranform string column only:", i
data.loc[:,i] = le.fit_transform(data.loc[:,i])
data.loc[:,col_cnt-1] = data.loc[:,col_cnt-1].apply(label)
print data.iloc[0,0], data.iloc[0,1]
x = data.iloc[:, range(col_cnt-1)]
#x = data.iloc[:, [0,4,5,6,7,8,22,23,24,25,26,27,28,29,30]]
y = data.iloc[:, col_cnt-1]
  
''' also OK
    data = data.as_matrix()
    x = data[:, range(col_cnt-1)]
    y = data[:, col_cnt-1]
'''
print "x=>"
print x.iloc[0:3, :]
print "y=>"
print y[-3:]
#v=load_kdd99("../data/kddcup99/corrected")
#x,y=get_guess_passwdandNormal(v)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)
print clf print cross_validation.cross_val_score(clf, x, y, n_jobs=-1, cv=10) clf = clf.fit(x, y)
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("../photo/6/iris-dt.pdf")

结果:

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41],
dtype='int64')
0 udp
311029
normal len: 60593
normal len: 4367
64960
tranform string column only: 1
tranform string column only: 2
tranform string column only: 3
0 2
x=>
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 \
0 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
1 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
2 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0 36 37 38 39 40
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 [3 rows x 41 columns]
y=>
142098 1
142099 1
142101 1
Name: 41, dtype: int64
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
fg[ 0.9561336 0.99892258 0.99938433 0.99984606 0.99984606 0.99969212
1. 0.99984604 0.99969207 1. ]

pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可的更多相关文章

  1. python pandas.DataFrame选取、修改数据最好用.loc,.iloc,.ix

    先手工生出一个数据框吧 import numpy as np import pandas as pd df = pd.DataFrame(np.arange(0,60,2).reshape(10,3) ...

  2. pandas.DataFrame.quantile

    pandas.DataFrame.quantile 用于返回数据中的 处于1/5    1/2(中位数)等数据

  3. 机器学习之数据预处理,Pandas读取excel数据

    Python读写excel的工具库很多,比如最耳熟能详的xlrd.xlwt,xlutils,openpyxl等.其中xlrd和xlwt库通常配合使用,一个用于读,一个用于写excel.xlutils结 ...

  4. 如何通过Elasticsearch Scroll快速取出数据,构造pandas dataframe — Python多进程实现

    首先,python 多线程不能充分利用多核CPU的计算资源(只能共用一个CPU),所以得用多进程.笔者从3.7亿数据的索引,取200多万的数据,从取数据到构造pandas dataframe总共大概用 ...

  5. Pandas DataFrame数据的增、删、改、查

    Pandas DataFrame数据的增.删.改.查 https://blog.csdn.net/zhangchuang601/article/details/79583551 #删除列 df_2 = ...

  6. Pandas DataFrame 数据选取和过滤

    This would allow chaining operations like: pd.read_csv('imdb.txt') .sort(columns='year') .filter(lam ...

  7. pandas.DataFrame——pd数据框的简单认识、存csv文件

    接着前天的豆瓣书单信息爬取,这一篇文章看一下利用pandas完成对数据的存储. 回想一下我们当时在最后得到了六个列表:img_urls, titles, ratings, authors, detai ...

  8. pandas中DataFrame和Series的数据去重

    在SQL语言中去重是一件相当简单的事情,面对一个表(也可以称之为DataFrame)我们对数据进行去重只需要GROUP BY 就好. select custId,applyNo from tmp.on ...

  9. 用PyQt5来即时显示pandas Dataframe的数据,附qdarkstyle黑夜主题样式(美美哒的黑夜主题)

    import sys from qdarkstyle import load_stylesheet_pyqt5 from PyQt5.QtWidgets import QApplication, QT ...

随机推荐

  1. APP热更新方案(转)

    本文转载自[http://creator.cnblogs.com/] 博客地址:Zealot Yin 为什么要做热更新 当一个App发布之后,突然发现了一个严重bug需要进行紧急修复,这时候公司各方就 ...

  2. codeforces Looksery Cup 2015 H Degenerate Matrix 二分 注意浮点数陷阱

    #include <cstdio> #include <cstring> #include <algorithm> #include <string> ...

  3. 关于Tomcat的启动

    1.Tomcat分为安装版和解压版. 2.在Tomcat的解压版的bin路径下启动startup.bat的时候,如果没有启动成功,请检查是否设置了JAVA_HOME 3.建议不要在环境变量里面设置CA ...

  4. String转换成int型

    private boolean judge(String str){ int year = 0; try{ year = Integer.valueOf(str).intValue(); }catch ...

  5. C#--DataGridView添加DateTimePicker时间控件

    using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; usin ...

  6. 解码URLDecode和编码URLEnCode

    在前台往后台传递参数的时候,在前台进行编码,在后台接收参数的时候,用Decode进行解码: 如果url中包含特殊字符如:&.html标签 <tr><td>等导致url无 ...

  7. IE6 css fixed

    .fixed-top{position:fixed;bottom:auto;top:0px;} .fixed-bottom{position:fixed;bottom:0px;top:auto;} . ...

  8. input[type="file"]的图片预览

    在项目中遇到用input标签file类型的文件上传,想实在上传之前进行图片的预览功能:之前的做的一个解决方案是文件先上传上去然后返回地址再显示在页面上,这样就不太好,因为用户基本信息可能并没有保存,但 ...

  9. js闭包详解-转自好友trigkit4

    闭包(closure)是Javascript语言的一个难点,也是它的特色,很多高级应用都要依靠闭包实现. 闭包的特性 闭包有三个特性: 1.函数嵌套函数 2.函数内部可以引用外部的参数和变量 3.参数 ...

  10. Shiro结合Spring boot开发权限管理系统

    前一篇文章说了,我从开始工作就想有一个属于自己的博客系统,当然了,我想的是多用户的博客,大家都可以发文章记笔记,我最初的想法就是这样. 博客系统搭建需要使用的技术: 1.基于Spring boot 2 ...