pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可
样本示意,为kdd99数据源:
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,domain_u,SF,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0.00,0.00,0.00,0.00,0.50,1.00,0.00,10,3,0.30,0.30,0.30,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,71,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,230,260,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,19,0.00,0.00,0.00,0.00,1.00,0.00,0.11,3,255,1.00,0.00,0.33,0.07,0.33,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,252,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
1,tcp,smtp,SF,3170,329,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,54,39,0.72,0.11,0.02,0.00,0.02,0.00,0.09,0.13,normal.
0,tcp,http,SF,297,13787,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,177,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,291,3542,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,12,12,0.00,0.00,0.00,0.00,1.00,0.00,0.00,187,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,295,753,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,21,22,0.00,0.00,0.00,0.00,1.00,0.00,0.09,196,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,268,9235,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.00,0.00,0.00,0.00,1.00,0.00,0.00,58,255,1.00,0.00,0.02,0.05,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,227,8841,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,13,13,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,222,19564,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,22,23,0.00,0.00,0.00,0.00,1.00,0.00,0.09,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,740,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,77,33,0.34,0.08,0.34,0.06,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,35195,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,10,0.00,0.00,0.00,0.00,1.00,0.00,0.00,92,44,0.43,0.07,0.43,0.05,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,8325,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,20,0.00,0.00,0.00,0.00,1.00,0.00,0.00,103,54,0.49,0.06,0.49,0.04,0.00,0.00,0.00,0.00,normal.
代码:
# -*- coding:utf-8 -*- import re
import matplotlib.pyplot as plt
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn import cross_validation
import os
from sklearn.datasets import load_iris
from sklearn import tree
import pydotplus
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
from sklearn_pandas import DataFrameMapper def label(x):
if x == "normal.":
return 0
else:
return 1 if __name__ == '__main__':
data = pd.read_csv('../data/kddcup99/corrected', sep=",", header=None)
print data.columns
print data.iloc[0,0], data.iloc[0,1]
print len(data)
col_cnt = len(data.columns) normal = data.loc[data.loc[:, col_cnt-1] == "normal.", :]
print "normal len:", len(normal)
guess = data.loc[data.loc[:, col_cnt-1] == "guess_passwd.", :]
print "normal len:", len(guess) data = pd.concat([normal, guess])
print len(data) le = preprocessing.LabelEncoder()
for i in range(col_cnt-1):
if isinstance(data.iloc[0,i], str):
print "tranform string column only:", i
data.loc[:,i] = le.fit_transform(data.loc[:,i])
data.loc[:,col_cnt-1] = data.loc[:,col_cnt-1].apply(label)
print data.iloc[0,0], data.iloc[0,1]
x = data.iloc[:, range(col_cnt-1)]
#x = data.iloc[:, [0,4,5,6,7,8,22,23,24,25,26,27,28,29,30]]
y = data.iloc[:, col_cnt-1]
''' also OK
data = data.as_matrix()
x = data[:, range(col_cnt-1)]
y = data[:, col_cnt-1]
'''
print "x=>"
print x.iloc[0:3, :]
print "y=>"
print y[-3:]
#v=load_kdd99("../data/kddcup99/corrected")
#x,y=get_guess_passwdandNormal(v)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)
print clf print cross_validation.cross_val_score(clf, x, y, n_jobs=-1, cv=10) clf = clf.fit(x, y)
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("../photo/6/iris-dt.pdf")
结果:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41],
dtype='int64')
0 udp
311029
normal len: 60593
normal len: 4367
64960
tranform string column only: 1
tranform string column only: 2
tranform string column only: 3
0 2
x=>
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 \
0 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
1 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
2 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0 36 37 38 39 40
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 [3 rows x 41 columns]
y=>
142098 1
142099 1
142101 1
Name: 41, dtype: int64
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
fg[ 0.9561336 0.99892258 0.99938433 0.99984606 0.99984606 0.99969212
1. 0.99984604 0.99969207 1. ]
pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可的更多相关文章
- python pandas.DataFrame选取、修改数据最好用.loc,.iloc,.ix
先手工生出一个数据框吧 import numpy as np import pandas as pd df = pd.DataFrame(np.arange(0,60,2).reshape(10,3) ...
- pandas.DataFrame.quantile
pandas.DataFrame.quantile 用于返回数据中的 处于1/5 1/2(中位数)等数据
- 机器学习之数据预处理,Pandas读取excel数据
Python读写excel的工具库很多,比如最耳熟能详的xlrd.xlwt,xlutils,openpyxl等.其中xlrd和xlwt库通常配合使用,一个用于读,一个用于写excel.xlutils结 ...
- 如何通过Elasticsearch Scroll快速取出数据,构造pandas dataframe — Python多进程实现
首先,python 多线程不能充分利用多核CPU的计算资源(只能共用一个CPU),所以得用多进程.笔者从3.7亿数据的索引,取200多万的数据,从取数据到构造pandas dataframe总共大概用 ...
- Pandas DataFrame数据的增、删、改、查
Pandas DataFrame数据的增.删.改.查 https://blog.csdn.net/zhangchuang601/article/details/79583551 #删除列 df_2 = ...
- Pandas DataFrame 数据选取和过滤
This would allow chaining operations like: pd.read_csv('imdb.txt') .sort(columns='year') .filter(lam ...
- pandas.DataFrame——pd数据框的简单认识、存csv文件
接着前天的豆瓣书单信息爬取,这一篇文章看一下利用pandas完成对数据的存储. 回想一下我们当时在最后得到了六个列表:img_urls, titles, ratings, authors, detai ...
- pandas中DataFrame和Series的数据去重
在SQL语言中去重是一件相当简单的事情,面对一个表(也可以称之为DataFrame)我们对数据进行去重只需要GROUP BY 就好. select custId,applyNo from tmp.on ...
- 用PyQt5来即时显示pandas Dataframe的数据,附qdarkstyle黑夜主题样式(美美哒的黑夜主题)
import sys from qdarkstyle import load_stylesheet_pyqt5 from PyQt5.QtWidgets import QApplication, QT ...
随机推荐
- cogs 2752. [济南集训 2017] 数列运算
2752. [济南集训 2017] 数列运算 ★★☆ 输入文件:sequenceQBXT.in 输出文件:sequenceQBXT.out 简单对比时间限制:1 s 内存限制:512 ...
- 极路由设置共享磁盘密码、跨网访问samba服务
极路由插上移动硬盘后会自动建立samba服务器,但我们没法去配置哪些盘符需要密码,这样只要在同一个wifi下的电脑都能去访问这些东西了,比较弱智.另外我还想再公司中去读写这个移动硬盘. 设置密码 首先 ...
- 接口測试-HAR
參考文章 雪球的 HttpApi 接口測试框架设计 HAR(HTTP Archive)规范 神器--Chrome开发人员工具(一) HAR是什么 一句话:关于HTTP所有的信息的一种文件保存格式 HA ...
- atitit。企业组织与软件project的策略 战略 趋势 原则 attilax 大总结
atitit. 企业组织与软件project的策略 战略 趋势 原则 attilax 大总结 1. 战略规划,适当的过度设计 1 2. 跨平台化 1 3. 可扩展性高于一切 1 4. 界面html5化 ...
- linux中sed的使用方法具体解释(对行数据的加入、删除等)
sed使用语法 [root@fwq test]# sed --help 使用方法: sed [选项]... {脚本(假设没有其它脚本)} [输入文件]... -n, --quiet, --silent ...
- bzoj4519: [Cqoi2016]不同的最小割(分治最小割)
4519: [Cqoi2016]不同的最小割 题目:传送门 题解: 同BZOJ 2229 基本一样的题目啊,就最后用set记录一下就ok 代码: #include<cstdio> #inc ...
- C# 程序集Assembly
原谅我到目前为止一直肤浅的认为程序集就是dll,这种想法是错误的. 今天就系统的学习记录一下“程序集”的概念.原文链接https://www.cnblogs.com/czx1/p/2014131370 ...
- jquery简介 each遍历 prop attr
一.JQ简介 jQuery是一个快速.简洁的JavaScript框架,它封装了JavaScript常用的功能代码,提供一种简便的JavaScript设计模式,优化HTML文档操作.事件处理.动画设计和 ...
- (转载)Mac下使用Android Studio 获取 SHA1和MD5
Mac下使用Android Studio 获取 SHA1和MD5 2015-08-10 15:38 1776人阅读 评论(1) 收藏 举报 分类: Android(14) 版权声明:本文为博主原创 ...
- .NET深入解析LINQ框架1
1.LINQ简述 2.LINQ优雅前奏的音符 2.1.隐式类型 (由编辑器自动根据表达式推断出对象的最终类型) 2.2.对象初始化器 (简化了对象的创建及初始化的过程) 2.3.Lambda表达式 ( ...