本文内容

最近看《写给程序员的数据挖掘指南》,研究推荐算法,书中的测试数据集是 Book-Crossing Dataset 提供的亚马逊用户对书籍评分的真实数据。推荐大家看本书,写得不错,立刻就能对推荐算法上手,甚至应用到你的项目中。

Book-Crossing Dataset 提供两种格式的数据集:CVS 格式SQL dump,问题是:

如果你有 UE 打开 cvs 文件,有乱码。无论如何转换编码,都不行~因为,这个文件是亚马逊通过程序持久化后,再导出来的。你还会发现,文件中有 html 标记,另外,关于用户名,书名等等信息,基本都是德文的(看域名就知道了)~

虽然,作者提供了加载测试数据集的 python 代码,不过不能导入到 MySQL 数据库中,其中,作者只是简单地按分号来分割字段内容(虽然推荐算法并不需要全部字段),可数据集中包含类似“ऩ”或“\“”这样的字符,不可能导入到 MySQL 数据库中~

你也许会问,作者都不导入到数据库,你为什么要导?因为,作者提供的推荐算法属于内存模型,也就是一次性把数据加载到内存,但之前,总还是要持久化吧~

因此,只能改造一下作者的 Python 代码~

Github Demo

改造后测试数据集

Python

# -*- coding: utf-8 -*-

 

import mysql.connector

import codecs

import string

import os

import sys

import ConfigParser

from collections import OrderedDict

import re

 

class MysqlPythonFacotry(object):

    """

        Python Class for connecting  with MySQL server.

    """

 

    __instance = None

    __host = None

    __user = None

    __password = None

    __database = None

    __session = None

    __connection = None

 

    def __init__(self, host='localhost', user='root', password='', database=''):

        self.__host = host

        self.__user = user

        self.__password = password

        self.__database = database

    ## End def __init__

 

    def open(self):

        try:

            cnx = mysql.connector.connect(host=self.__host,\

                user= self.__user,\

                password= self.__password,\

                database= self.__database)

            self.__connection = cnx

            self.__session = cnx.cursor()

        except mysql.connector.Error as e:

            print('connect fails!{}'.format(e))

    ## End def open

 

    def close(self):

        self.__session.close()

        self.__connection.close()

    ## End def close

 

    def select(self, table, where=None, *args, **kwargs):

        result = None

        query = 'SELECT '

        keys = args

        values = tuple(kwargs.values())

        l = len(keys) - 1

 

        for i, key in enumerate(keys):

            query += "`" + key + "`"

            if i <; l:

                query += ","

        ## End for keys

 

        query += 'FROM %s' % table

 

        if where:

            query += " WHERE %s" % where

        ## End if where

 

        self.__session.execute(query, values)

        number_rows = self.__session.rowcount

        number_columns = len(self.__session.description)

        result = self.__session.fetchall()

 

        return result

    ## End def select

 

    def update(self, table, where=None, *args, **kwargs):

        try:

            query = "UPDATE %s SET " % table

            keys = kwargs.keys()

            values = tuple(kwargs.values()) + tuple(args)

            l = len(keys) - 1

            for i, key in enumerate(keys):

                query += "`" + key + "` = %s"

                if i <; l:

                    query += ","

                ## End if i less than 1

            ## End for keys

            query += " WHERE %s" % where

 

            self.__session.execute(query, values)

            self.__connection.commit()

 

            # Obtain rows affected

            update_rows = self.__session.rowcount

 

        except mysql.connector.Error as e:

            print(e.value)

 

        return update_rows

    ## End function update

 

    def insert(self, table, *args, **kwargs):

        values = None

        query = "INSERT INTO %s " % table

        if kwargs:

            keys = kwargs.keys()

            values = tuple(kwargs.values())

            query += "(" + ",".join(["`%s`"] * len(keys)) % tuple(keys) + ") VALUES (" + ",".join(["%s"] * len(values)) + ")"

        elif args:

            values = args

            query += " VALUES(" + ",".join(["%s"] * len(values)) + ")"

 

        self.__session.execute(query, values)

        self.__connection.commit()

        cnt = self.__session.rowcount

        return cnt

    ## End def insert

 

    def delete(self, table, where=None, *args):

        query = "DELETE FROM %s" % table

        if where:

            query += ' WHERE %s' % where

 

        values = tuple(args)

 

        self.__session.execute(query, values)

        self.__connection.commit()

        delete_rows = self.__session.rowcount

        return delete_rows

    ## End def delete

 

    def select_advanced(self, sql, *args):

        od = OrderedDict(args)

        query = sql

        values = tuple(od.values())

        self.__session.execute(query, values)

        number_rows = self.__session.rowcount

        number_columns = len(self.__session.description)

        result = self.__session.fetchall()

        return result

    ## End def select_advanced

## End class

 

 

class ErrorMyProgram(Exception):

    """

        My Exception Error Class

    """

    def __init__(self, value):

        self.value = value

    ##End def __init__

        

    def __str__(self):

        return repr(self.value)

    ##End def __str__

 ## End class ErrorMyProgram

    

    

class LoadAppConf(object):

    """

        Load app.conf Config File Class

    """

    __configFileName = "app.conf"

 

    def __init__(self):

        config = ConfigParser.ConfigParser()

        config.read(self.__configFileName)

 

        self.biz_db_host = config.get("biz_db","host") 

        self.biz_db_user = config.get("biz_db","user") 

        self.biz_db_password = config.get("biz_db","password")

        self.biz_db_database = config.get("biz_db","database")

    ## End def __init__

 ## End class LoadAppConf    

        

class Biz_Base(object):

    """

        biz base class

    """

    def __init__(self, db):

        self.db = db

    ## End def __init__

 ## End class Biz_Base

        

 

class Biz_bx_book_ratings(Biz_Base):

    """

        bx_book_ratings table

    """

 

    __tableName = "bx_book_ratings"

 

    def __init__(self, db):

        Biz_Base.__init__(self, db)

    ## End def __init__

        

    def insert(self, userid, isbn, bookrating):

        cnt = self.db.insert(self.__tableName,\

            userid = userid, \

            isbn = isbn,\

            bookrating = bookrating)

        return cnt >; 0

    ## End def insert

 ## End class Biz_bx_book_ratings    

 

 

class Biz_bx_books(Biz_Base):

    """

        bx_books table

    """

 

    __tableName = "bx_books"

 

    def __init__(self, db):

         Biz_Base.__init__(self, db)

    ## End def __init__

         

    def insert(self, isbn, booktitle, bookauthor, yearofpublication, publisher, imageurls, imageurlm, imageurll):

        cnt = self.db.insert(self.__tableName,\

            isbn = isbn, \

            booktitle = booktitle, \

            bookauthor = bookauthor,\

            yearofpublication = yearofpublication, \

            publisher = publisher, \

            imageurls = imageurls, \

            imageurlm = imageurlm, \

            imageurll = imageurll)

        return cnt >; 0

    ## End def insert

## End class Biz_bx_books 

 

class Biz_bx_users(Biz_Base):

    """

        bx_users table

    """

 

    __tableName = "bx_users"

 

    def __init__(self, db):

         Biz_Base.__init__(self, db)

    ## End def __init__

         

    def insert(self, userid, location, age):

        cnt = self.db.insert(self.__tableName,\

            userid = userid, \

            location = location,\

            age = age)

        return cnt >; 0

    ## End def insert

## End class Biz_bx_users 

 

def regx(l):

    """

        split line by regex

    """

    p = re.compile(r'"[^"]*"')

    return p.findall(l)

## End def regx 

 

class LoadDataset(object):

    """

        bx_books table

    """

    

    __loadConf = None

    

    __users = None

    __books = None

    __book_ratings = None

    

    __bizDb = None    

 

    def __init__(self):

        self.__loadConf = LoadAppConf()

        

        self.__bizDb = MysqlPythonFacotry(self.__loadConf.biz_db_host,\

                 self.__loadConf.biz_db_user, \

                 self.__loadConf.biz_db_password,\

                 self.__loadConf.biz_db_database)

 

        self.__users = Biz_bx_users(self.__bizDb)

        self.__books = Biz_bx_books(self.__bizDb)

        self.__book_ratings = Biz_bx_book_ratings(self.__bizDb)

    

        self.__bizDb.open()

    ## End def __init__

        

    def toDB(self, path=''):

        """

            loads the BX book dataset. Path is where the BX files are

            located

        """

        

        self.data = {}

        i = 0

        j = 0

        try:

            #

            # First load book ratings into self.data

            #

            f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')

            for line in f:

                i += 1

                j += 1

                

                print(j)

                print(line)

                

                #separate line into fields

                fields = line.split(';')

                user = fields[0].strip('"')

                book = fields[1].strip('"')

                rating = int(fields[2].strip().strip('"'))

 

                self.__book_ratings.insert(user, book, rating)

 

            f.close()

            #

            # Now load books into self.productid2name

            # Books contains isbn, title, and author among other fields

            #

            j = 0

            f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')

            for line in f:

                i += 1

                j += 1

 

                print(j)

                print(line)

                

                #separate line into fields

                fields = regx(line)

                isbn = fields[0].strip('"')

                title = fields[1].strip('"')

                author = fields[2].strip().strip('"')

                yearOfPublication = fields[3].strip().strip('"')

                publisher = fields[4].strip().strip('"')

                imageUrlS = fields[5].strip().strip('"')

                imageUrlM = fields[6].strip().strip('"')

                imageUrlL = fields[7].strip().strip('"')

 

                self.__books.insert(isbn, title, author, yearOfPublication, publisher, imageUrlS, imageUrlM, imageUrlL)

            f.close()

            #

            #  Now load user info into both self.userid2name and

            #  self.username2id

            #

            j = 0

            f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')

            for line in f:

                i += 1

                j += 1

                

                print(j)

                print(line)

                

                #separate line into fields                

                fields = regx(line)

                userid = fields[0].strip('"')

                location = fields[1].strip('"')

                if len(fields) >; 2:

                    age = fields[2].strip().strip('"')

                else:

                    age = None

                if age != None:

                    value = location + '  (age: ' + age + ')'

                else:

                    value = location

 

                if age == None:

                    age =0

   

                self.__users.insert(userid, location, age)

                                    

            f.close()

        except  ErrorMyProgram as e:

            print(e.value)

        finally:

            self.__bizDb.close()

 

        print(i)

    ## End def toDB

## End class LoadData    

Github Demo

测试数据集

将 Book-Crossing Dataset 书籍推荐算法中 CVS 格式测试数据集导入到MySQL数据库的更多相关文章

  1. 用JDBC把Excel中的数据导入到Mysql数据库中

    步骤:0.在Mysql数据库中先建好table 1.从Excel表格读数据 2.用JDBC连接Mysql数据库 3.把读出的数据导入到Mysql数据库的相应表中 其中,步骤0的table我是先在Mys ...

  2. SQL自连接(源于推荐算法中的反查表问题)

    ”基于用户的协同过滤算法“是推荐算法的一种,这类算法强调的是:把和你有相似爱好的其他的用户的物品推荐给你. 要实现该推荐算法,就需要计算和你有交集的用户,这就要用到物品到用户的反查表. 先举个例子说明 ...

  3. Attention机制在深度学习推荐算法中的应用(转载)

    AFM:Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Ne ...

  4. Access数据库导入到mysql数据库中

    做项目时需要查询手机号归属地的,用网上提供的接口,耗时太长,反应慢,只能自己在网上搜了一个包含所有手机号归属地的Access数据库,导入到自己的mysql数据库中 Access数据库导入到mysql中 ...

  5. 虚拟机中ubuntu-16.04 Linux系统下配置mysql数据库,并在windows下使用navicat远程连接

    Linux系统下mysql数据库安装配置步骤: 1.在服务器上安装mysql:sudo apt-get install mysql-server sudo apt-get install mysql- ...

  6. 如何用java POI将word中的内容导入到mysql数据库中

    由于作业需要,要求我们将word文档中的数据直接导入到mysql中,在网上找了很常时间,终于将其解决. 由于比较初级,所以处理的word文档是那种比较规范的那种,条例比较清晰,设计的思路也比较简单,就 ...

  7. MySQL中 如何查询表名中包含某字段的表 ,查询MySql数据库架构信息:数据库,表,表字段

    --查询tablename 数据库中 以"_copy" 结尾的表 select table_name from information_schema.tables where ta ...

  8. 将Hive统计分析结果导入到MySQL数据库表中(一)——Sqoop导入方式

    https://blog.csdn.net/niityzu/article/details/45190787 交通流的数据分析,需求是对于海量的城市交通数据,需要使用MapReduce清洗后导入到HB ...

  9. php中ip转int 并存储在mysql数据库

    遇到一个问题,于是百度一下. 得到最佳答案 http://blog.163.com/metlive@126/blog/static/1026327120104232330131/     如何将四个字 ...

随机推荐

  1. SQL基础分类

    我们可以把学习过的sql语言,进行分类: 1. DDL : 数据定义语言 a) 操作库表结构的语言. Create drop alter 2. DML : 数据操作语言 a) 操作数据的语言: upd ...

  2. SignalR一个集成的客户端与服务器库。内部的两个对象类:PersistentConnection和Hub

    SignalR 将整个交换信息的行为封装得非常漂亮,客户端和服务器全部都使用 JSON 来沟通,在服务器端声明的所有 hub 的信息,都会一般生成 JavaScript 输出到客户端. 它是基于浏览器 ...

  3. Monte Carlo 数值积分

    var amount = 0.0d; var hitTheTargetCount = 0.0d; var M = 2.0d; var rnd=new Random(); ; i < ; i++) ...

  4. 无法远程连接ubuntu下的mysql

    修改前 无法telnet 2.2.2.128 3306 打开 /etc/mysql/my.cnf 文件,找到 bind-address = 127.0.0.1 修改为 bind-address = 0 ...

  5. ThinkPHP 模板判断输出--Switch 标签

    ThinkPHP 模板引擎支持 switch 判断,根据不同情况输出不同的值,格式如下:<switch name="变量名" >    <case value=& ...

  6. eclipse里面构建maven项目详解(转载)

    本文来源于:http://my.oschina.net/u/1540325/blog/548530 eclipse里面构建maven项目详解 1       环境安装及分配 Maven是基于项目对象模 ...

  7. 基于OpenCv的人脸检测、识别系统学习制作笔记之二

    在网上找到了一个博客,里面有大量内容适合初学者接触和了解人脸检测的博文,正好符合我目前的学习方面,故将链接放上来,后续将分类原博客的博文并加上学习笔记. 传送门: http://blog.sina.c ...

  8. 《Automatic Face Classification of Cushing’s Syndrome in Women – A Novel Screening Approach》学习笔记

    <针对女性库欣综合征患者的自动面部分类-一种新颖的筛查方法> Abstract 目的:库兴氏综合征对身体造成相当大的伤害如果不及时治疗,还经常是诊断的时间太长.在这项研究中,我们旨在测试面 ...

  9. PHP文件相关

    <?php class FileDemo { function Test() { print __FILE__."<br/>"; //返回文件完整路径,如 E:/ ...

  10. SQLite主键自增需要设置为integer PRIMARY KEY

    按照正常的SQL语句,创建一个数据表,并设置主键是这样的语句: ), EventType )) 但使用这种办法,在SQLite中创建的的数据表,如果使用Insert语句插入记录,如下语句: INSER ...