【转载】导入GoogleClusterData到MySQL

原文地址：

https://www.cnblogs.com/instant7/p/4159022.html

---------------------------------------------------------------------------------------------

　　本篇随笔记录如何导入google-cluster-data-2011-1-2的

job_events和task_events到MySQL

1. 下载数据

download_job_events:

import urllib2

url = 'https://commondatastorage.googleapis.com/clusterdata-2011-2/'

f = open('C:\\SHA256SUM')

l = f.readlines()

f.close()

for i in l:

    if i.count('job_events')>0:

        fileAddr = i.split()[1][1:]

        fileName = fileAddr.split('/')[1]

        print 'downloading', fileName

        data = urllib2.urlopen(url+fileAddr).read()

        print 'saving', fileName

        fileDown = open('C:\\job_events\\'+fileName, 'wb')

        fileDown.write(data)

        fileDown.close()

（ps: 由于上面的代码为python2.7的，现在一般使用python3的，于是给出python3版本的代码如下：

#encoding:UTF-8

from urllib import request

url = 'https://commondatastorage.googleapis.com/clusterdata-2011-2/'

f = open('C:\\SHA256SUM')

l = f.readlines()

f.close()

for i in l:

    if i.count('job_events')>0:

        fileAddr = i.split()[1][1:]

        fileName = fileAddr.split('/')[1]

        print('downloading', fileName)

        data = request.urlopen(url+fileAddr).read()

        print('saving', fileName)

        fileDown = open('C:\\job_events\\'+fileName, 'wb')

        fileDown.write(data)

        fileDown.close()

）

download_task_events:

import urllib2

url = 'https://commondatastorage.googleapis.com/clusterdata-2011-2/'

f = open('C:\\SHA256SUM')

l = f.readlines()

f.close()

for i in l:

    if i.count('task_events')>0:

        fileAddr = i.split()[1][1:]

        fileName = fileAddr.split('/')[1]

        print 'downloading', fileName

        data = urllib2.urlopen(url+fileAddr).read()

        print 'saving', fileName

        fileDown = open('C:\\task_events\\'+fileName, 'wb')

        fileDown.write(data)

        fileDown.close()

（ps: 由于上面的代码为python2.7的，现在一般使用python3的，于是给出python3版本的代码如下：

#encoding:UTF-8

from urllib import request

url = 'https://commondatastorage.googleapis.com/clusterdata-2011-2/'

f = open('C:\\SHA256SUM')

l = f.readlines()

f.close()

for i in l:

    if i.count('task_events')>0:

        fileAddr = i.split()[1][1:]

        fileName = fileAddr.split('/')[1]

        print('downloading', fileName)

        data = request.urlopen(url+fileAddr).read()

        print('saving', fileName)

        fileDown = open('C:\\task_events\\'+fileName, 'wb')

        fileDown.write(data)

        fileDown.close()

)

注意：这次用的数据是

clusterdata-2011-2

不同于之前重画GoogleCLusterData中的

clusterdata-2011-1

2. 解压缩

由于不能直接导入压缩包里的数据到mysql，故先将它们解压缩

unzip_job_events:

import gzip

import os

fileNames = os.listdir('C:\\task_events')

for l in fileNames:

    print 'now at: '+ l

    f = gzip.open('C:\\job_events\\'+l)

    fOut = open('C:\\job_events_unzip\\'+l[:-3], 'w')

    content = f.read()

    fOut.write(content)

    f.close()

    fOut.close()

    #raw_input()

（

python3 版本

import gzip

import os

fileNames = os.listdir('C:\\job_events')

for l in fileNames:

    print( 'now at: '+ l )

    f = gzip.open('C:\\job_events\\'+l)

    fOut = open('C:\\job_events_unzip\\'+l[:-3], 'wb')

    content = f.read()

    fOut.write(content)

    f.close()

    fOut.close()

    #raw_input()

）

unzip_task_events:

import gzip

import os

fileNames = os.listdir('C:\\task_events')

for l in fileNames:

    print 'now at: '+ l

    f = gzip.open('C:\\task_events\\'+l)

    fOut = open('C:\\task_events_unzip\\'+l[:-3], 'w')

    content = f.read()

    fOut.write(content)

    f.close()

    fOut.close()

（

python3 版本：

import gzip

import os

fileNames = os.listdir('C:\\task_events')

for l in fileNames:

    print( 'now at: '+ l )

    f = gzip.open('C:\\task_events\\'+l)

    fOut = open('C:\\task_events_unzip\\'+l[:-3], 'wb')

    content = f.read()

    fOut.write(content)

    f.close()

    fOut.close()

    #raw_input()

）

3. 建数据库

create_job_events:

create table job_events(

time bigint,

missing_info int,

job_id bigint,

event_type int,

user text,

scheduling_class int,

job_name text,

logical_job_name text)

engine = myisam;

create_task_events:

create table task_events(

time bigint,

missing_info int,

job_id bigint,

task_index bigint,

machine_id bigint,

event_type int,

user text,

scheduling_class int,

priority int,

cpu_request float,

memory_request float,

disk_space_request float,

difference_machine_restriction boolean

)engine = myisam;

注意：由于数据量非常大，这里一定要选择myisam作为engine。

4. 导入数据

由于数据中有部分为空的值，需要先设定mysql使其能够导入空值。

具体方法为：

在mysql的控制台输入

SET @@GLOBAL.sql_mode="NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION";

之后就可以开始导入数据了。

注意！！以下代码在导入类似2.3e-10的数据会产生严重问题，具体为导入的数据在MySQL中变为负数，而且绝对值不小！！！

loadJobEvents2MySQL.py

import os

import MySQLdb

fileNames = os.listdir('C:\\task_events_unzip')

conn=MySQLdb.connect(host="localhost",user="root",passwd="",db="googleclusterdata",charset="utf8")

cursor = conn.cursor()

cursor.execute('truncate job_events')

for f in fileNames:

    print 'now at: '+ f

    order = "load data infile 'C:/job_events_unzip/%s' into table job_events fields terminated by ',' lines terminated by '\n'" %f

    print order

    cursor.execute(order)

    conn.commit()

loadTaskEvents2MySQL.py

import os

import MySQLdb

fileNames = os.listdir('C:\\task_events_unzip')

conn=MySQLdb.connect(host="localhost",user="root",passwd="",db="googleclusterdata",charset="utf8")

cursor = conn.cursor()

cursor.execute('truncate task_events')

for f in fileNames:

    print 'now at: '+ f

    order = "load data infile 'C:/task_events_unzip/%s' into table task_events fields terminated by ',' lines terminated by '\n'" %f

    print order

    cursor.execute(order)

    conn.commit()

注意：这里需要相应的修改密码和使用的数据库名（db）

---------------------------------------------------------------------------------------------

【转载】导入GoogleClusterData到MySQL的更多相关文章

导入GoogleClusterData到MySQL
本篇随笔记录如何导入google-cluster-data-2011-1-2的 job_events和task_events到MySQL 1. 下载数据 download_job_events: im ...
linux下导入、导出mysql数据库命令下载文件到本地
一.下载到本地 yum install lrzsz sz filename 下载 rz filename 上传 linux下导入.导出mysql数据库命令一.导出数据库用mysqldump命 ...
linux、windows下导入、导出mysql数据库命令
一.导出数据库用mysqldump命令(注意mysql的安装路径,即此命令的路径): 1.导出数据和表结构:[不是mysql里的命令]mysqldump -u用户名 -p密码数据库名 > 数据 ...
完美转换MySQL的字符集 Mysql 数据的导入导出，Mysql 4.1导入到4.0
MySQL从4.1版本开始才提出字符集的概念,所以对于MySQL4.0及其以下的版本,他们的字符集都是Latin1的,所以有时候需要对mysql的字符集进行一下转换,MySQL版本的升级.降级,特别是 ...
cpanel导入大数据库(mysql)的方法
phpmyadmin是一件很方便的在线管理MySQL数据库的工具,但对于较大的数据库的导出和导入却很容易出错.特别是导入工作,通常5M已经是它的极限了.这里,主要介绍一下如何通过cPanel导入大型的 ...
mysql 数据库导入数据报错MySQL server has gone away解决办法
mysql 数据库导入数据报错MySQL server has gone away解决办法: 进入数据库执行以下命令即可: set global wait_timeout = 2880000; set ...
导入数据到mysql的一种简单的方法
由于ubuntu默认自带的mysql版本号为5.5,并不能使用load data infile这样的高级的功能,因此我们写了一个通用的脚本来上传文件 shell脚本 cat ./employee.cs ...
随笔编号-09 批量导入数据（Mysql）报MySQL server has gone away 问题的解决方法
问题场景: 使用*.sql 脚本,批量导入数据到mysql实例中,使用DOS 界面导入的,期间,到最后一步 source D:\aaa.sql 回车后,系统提示 MySQL server has g ...
Sqoop导入数据到mysql数据库报错：ERROR tool.ExportTool: Error during export: Export job failed！（已解决）
问题描述: Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Conta ...

随机推荐

java.lang.RuntimeException: org.springframework.dao.DuplicateKeyException:
java.lang.RuntimeException: org.springframework.dao.DuplicateKeyException: ### Error updating databa ...
arcgis api for js 出现跨域问题
最近几天在开始入手 arcgis api for js .那就先写些Demo练练手. 选择百度地图.这里用的是拼地图 url 的方式来加载百度地图. 加载百度地图参考的是:ArcGIS API for ...
LINQ查询表达式（4） - LINQ Join联接
内部联接按照关系数据库的说法,“内部联接”产生一个结果集,对于该结果集内第一个集合中的每个元素,只要在第二个集合中存在一个匹配元素,该元素就会出现一次. 如果第一个集合中的某个元素没有匹配元素,则它 ...
ashx 图片上传路径URL
ashx 图片上传为了方便多出调用图片上传方法首先我们将图片上传方法抽离出来创建ashx 一个新的方法 using System; using System.Collection ...
am335x system upgrade kernel f-ram fm25l16b(十六)
1 Scope of Document This document describes SPI F-RAM hardware design 2 Requiremen 2.1 ...
noi.ac #37 dp计数
#include<algorithm> #include<cstring> #include<cstdio> #include<iostream> ty ...
noi.ac #38 线段树+时间复杂度分析
\(des\) 存在参数数组 \(a\),\(a\) 升序排列 \[a_1 < a_2 < \cdots < a_m, m <= 10\] 存在长度为 \(n\) 价值数组 \ ...
Bzoj 2818: Gcd(莫比乌斯反演)
2818: Gcd Time Limit: 10 Sec Memory Limit: 256 MB Description 给定整数N,求1<=x,y<=N且Gcd(x,y)为素数的数对 ...
redis系列（四）：切换RDB备份到AOF备份
1.准备环境 redis.conf服务端配置如下: daemonize yes port logfile /data//redis.log dir /data/ dbfilename dbmp.rdb ...
__enter__,__exit__区别
__enter__():在使用with语句时调用,会话管理器在代码块开始前调用,返回值与as后的参数绑定 __exit__():会话管理器在代码块执行完成好后调用,在with语句完成时,对象销毁之前调 ...

【转载】 导入GoogleClusterData到MySQL

【转载】 导入GoogleClusterData到MySQL的更多相关文章

随机推荐

热门专题

【转载】导入GoogleClusterData到MySQL

【转载】导入GoogleClusterData到MySQL的更多相关文章