理解问题

客户细分需要解决的问题是按照客户之间的相似特征区分不同客户群体。这个问题的先决条件中没有可供使用的客户分类列表，只有客户的人物画像。

数据集

已有的数据是公司的历史商业活动记录以及客户的购买记录。

offer.csv：

Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak

1,January,Malbec,72,56,France,FALSE

2,January,Pinot Noir,72,17,France,FALSE

3,February,Espumante,144,32,Oregon,TRUE

4,February,Champagne,72,48,France,TRUE

5,February,Cabernet Sauvignon,144,44,New Zealand,TRUE

6,March,Prosecco,144,86,Chile,FALSE

7,March,Prosecco,6,40,Australia,TRUE

8,March,Espumante,6,45,South Africa,FALSE

9,April,Chardonnay,144,57,Chile,FALSE

10,April,Prosecco,72,52,California,FALSE

11,May,Champagne,72,85,France,FALSE

12,May,Prosecco,72,83,Australia,FALSE

13,May,Merlot,6,43,Chile,FALSE

14,June,Merlot,72,64,Chile,FALSE

15,June,Cabernet Sauvignon,144,19,Italy,FALSE

16,June,Merlot,72,88,California,FALSE

17,July,Pinot Noir,12,47,Germany,FALSE

18,July,Espumante,6,50,Oregon,FALSE

19,July,Champagne,12,66,Germany,FALSE

20,August,Cabernet Sauvignon,72,82,Italy,FALSE

21,August,Champagne,12,50,California,FALSE

22,August,Champagne,72,63,France,FALSE

23,September,Chardonnay,144,39,South Africa,FALSE

24,September,Pinot Noir,6,34,Italy,FALSE

25,October,Cabernet Sauvignon,72,59,Oregon,TRUE

26,October,Pinot Noir,144,83,Australia,FALSE

27,October,Champagne,72,88,New Zealand,FALSE

28,November,Cabernet Sauvignon,12,56,France,TRUE

29,November,Pinot Grigio,6,87,France,FALSE

30,December,Malbec,6,54,France,FALSE

31,December,Champagne,72,89,France,FALSE

32,December,Cabernet Sauvignon,72,45,Germany,TRUE

transaction.csv：

Customer Last Name,Offer #

Smith,2

Smith,24

Johnson,17

Johnson,24

Johnson,26

Williams,18

Williams,22

Williams,31

Brown,7

Brown,29

Brown,30

Jones,8

Miller,6

Miller,10

Miller,14

Miller,15

Miller,22

Miller,23

Miller,31

Davis,12

Davis,22

Davis,25

Garcia,14

Garcia,15

Rodriguez,2

Rodriguez,26

Wilson,8

Wilson,30

Martinez,12

Martinez,25

Martinez,28

Anderson,24

Anderson,26

Taylor,7

Taylor,18

Taylor,29

Taylor,30

Thomas,1

Thomas,4

Thomas,9

Thomas,11

Thomas,14

Thomas,26

Hernandez,28

Hernandez,29

Moore,17

Moore,24

Martin,2

Martin,11

Martin,28

Jackson,1

Jackson,2

Jackson,11

Jackson,15

Jackson,22

Thompson,9

Thompson,16

Thompson,25

Thompson,30

White,14

White,22

White,25

White,30

Lopez,9

Lopez,11

Lopez,15

Lopez,16

Lopez,27

Lee,3

Lee,4

Lee,6

Lee,22

Lee,27

Gonzalez,9

Gonzalez,31

Harris,4

Harris,6

Harris,7

Harris,19

Harris,22

Harris,27

Clark,4

Clark,11

Clark,28

Clark,31

Lewis,7

Lewis,8

Lewis,30

Robinson,7

Robinson,29

Walker,18

Walker,29

Perez,18

Perez,30

Hall,11

Hall,22

Young,6

Young,9

Young,15

Young,22

Young,31

Young,32

Allen,9

Allen,27

Sanchez,4

Sanchez,5

Sanchez,14

Sanchez,15

Sanchez,20

Sanchez,22

Sanchez,26

Wright,4

Wright,6

Wright,21

Wright,27

King,7

King,13

King,18

King,29

Scott,6

Scott,14

Scott,23

Green,7

Baker,7

Baker,10

Baker,19

Baker,31

Adams,18

Adams,29

Adams,30

Nelson,3

Nelson,4

Nelson,8

Nelson,31

Hill,8

Hill,13

Hill,18

Hill,30

Ramirez,9

Campbell,2

Campbell,24

Campbell,26

Mitchell,1

Mitchell,2

Roberts,31

Carter,7

Carter,13

Carter,29

Carter,30

Phillips,17

Phillips,24

Evans,22

Evans,27

Turner,4

Turner,6

Turner,27

Turner,31

Torres,8

Parker,11

Parker,16

Parker,20

Parker,29

Parker,31

Collins,11

Collins,30

Edwards,8

Edwards,27

Stewart,8

Stewart,29

Stewart,30

Flores,17

Flores,24

Morris,17

Morris,24

Morris,26

Nguyen,19

Nguyen,31

Murphy,7

Murphy,12

Rivera,7

Rivera,18

Cook,24

Cook,26

Rogers,3

Rogers,7

Rogers,8

Rogers,19

Rogers,21

Rogers,22

Morgan,8

Morgan,29

Peterson,1

Peterson,2

Peterson,10

Peterson,23

Peterson,26

Peterson,27

Cooper,4

Cooper,16

Cooper,20

Cooper,32

Reed,5

Reed,14

Bailey,7

Bailey,30

Bell,2

Bell,17

Bell,24

Bell,26

Gomez,11

Gomez,20

Gomez,25

Gomez,32

Kelly,6

Kelly,20

Kelly,31

Kelly,32

Howard,11

Howard,12

Howard,22

Ward,4

Cox,2

Cox,17

Cox,24

Cox,26

Diaz,7

Diaz,8

Diaz,29

Diaz,30

Richardson,3

Richardson,6

Richardson,22

Wood,1

Wood,10

Wood,14

Wood,31

Watson,7

Watson,29

Brooks,3

Brooks,8

Brooks,11

Brooks,22

Bennett,8

Bennett,29

Gray,12

Gray,16

Gray,26

James,7

James,8

James,13

James,18

James,30

Reyes,9

Reyes,23

Cruz,29

Cruz,30

Hughes,7

Hughes,8

Hughes,13

Hughes,29

Hughes,30

Price,1

Price,22

Price,30

Price,31

Myers,18

Myers,30

Long,3

Long,7

Long,10

Foster,1

Foster,9

Foster,14

Foster,22

Foster,23

Sanders,1

Sanders,4

Sanders,5

Sanders,6

Sanders,9

Sanders,11

Sanders,20

Sanders,25

Sanders,26

Ross,18

Ross,21

Morales,6

Morales,7

Morales,8

Morales,19

Morales,22

Morales,31

Powell,5

Sullivan,8

Sullivan,13

Sullivan,18

Russell,26

Ortiz,8

Jenkins,24

Jenkins,26

Gutierrez,6

Gutierrez,8

Gutierrez,10

Gutierrez,18

Perry,8

Perry,18

Perry,29

Perry,30

Butler,1

Butler,4

Butler,22

Butler,28

Butler,30

Barnes,10

Barnes,21

Barnes,22

Barnes,31

Fisher,1

Fisher,2

Fisher,11

Fisher,22

Fisher,28

Fisher,30

Fisher,31

预处理

需要对两个数据集做关联处理，这样才能得到单一的视图。同时由于需要比较客户所产生的交易，还需要建立一张透视表。行代表客户，列代表商业活动，单元格值则显示是否客户有购买行为。

var offers = Offer.ReadFromCsv(_offersCsv);

var transactions = Transaction.ReadFromCsv(_transactionsCsv);

var clusterData = (from of in offers

                   join tr in transactions on of.OfferId equals tr.OfferId

                   select new

                   {

                       of.OfferId,

                       of.Campaign,

                       of.Discount,

                       tr.LastName,

                       of.LastPeak,

                       of.Minimum,

                       of.Origin,

                       of.Varietal,

                       Count = 1,

                   }).ToArray();

var count = offers.Count();

var pivotDataArray =

    (from c in clusterData

     group c by c.LastName into gcs

     let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count)

     select new PivotData()

     {

         LastName = gcs.Key,

         Features = ToFeatures(lookup, count)

     }).ToArray();

ToFeatures方法依据商业活动的数量，生成所需的特征数组。

private static float[] ToFeatures(ILookup<string, int> lookup, int count)

{

    var result = new float[count];

    foreach (var item in lookup)

    {

        var key = Convert.ToInt32(item.Key) - 1;

        result[key] = item.Sum();

    }

    return result;

}

数据视图

取得用于生成视图的数组后，这里使用CreateStreamingDataView方法构建数据视图。而又因为Features属性是一个数组，所以必须声明其大小。

var mlContext = new MLContext();

var schemaDef = SchemaDefinition.Create(typeof(PivotData));

schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count);

var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef);

PCA

PCA(principal Component Analysis)，主成分分析，是为了将过多的维度值减少至一个合适的范围以便于分析，这里是降到二维空间。

new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2)

OneHotEncoding

One Hot Encoding在此处的作用是将LastName从字符串转换为数字矩阵。

new OneHotEncodingEstimator(mlContext, new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) })

训练器

K-Means是常用的应对聚类问题的训练器，这里假设要分为三类。

mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3)

训练模型

trainingPipeline.Fit(pivotDataView);

评估模型

var predictions = trainedModel.Transform(pivotDataView);

var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features");

Console.WriteLine($"*************************************************");

Console.WriteLine($"*       Metrics for {trainer} clustering model      ");

Console.WriteLine($"*------------------------------------------------");

Console.WriteLine($"*       AvgMinScore: {metrics.AvgMinScore}");

Console.WriteLine($"*       DBI is: {metrics.Dbi}");

Console.WriteLine($"*************************************************");

可得到如下的评估结果。

*************************************************

*       Metrics for Microsoft.ML.Trainers.KMeans.KMeansPlusPlusTrainer clustering model

*------------------------------------------------

*       AvgMinScore: 2.3154067927599

*       DBI is: 2.69100740819456

*************************************************

使用模型

var clusteringPredictions = predictions

    .AsEnumerable<ClusteringPrediction>(mlContext, false)

    .ToArray();

画图

为了更直观地观察，可以用OxyPlot类库生成结果图片。

添加类库：

dotnet add package OxyPlot.Core

Plot生成处理：

var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true };

var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x);

foreach (var cluster in clusters)

{

    var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true };

    var series = clusteringPredictions

        .Where(p => p.SelectedClusterId == cluster)

        .Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray();

    scatter.Points.AddRange(series);

    plot.Series.Add(scatter);

}

plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors;

var exporter = new SvgExporter { Width = 600, Height = 400 };

using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create))

{

    exporter.Export(plot, fs);

}

最后的图片如下所示：

完整示例代码

Program类：

using CustomerSegmentation.DataStructures;

using Microsoft.ML;

using System;

using System.IO;

using System.Linq;

using Microsoft.ML.Runtime.Api;

using Microsoft.ML.Transforms.Projections;

using Microsoft.ML.Transforms.Categorical;

using Microsoft.ML.Runtime.Data;

using OxyPlot;

using OxyPlot.Series;

using Microsoft.ML.Core.Data;

namespace CustomerSegmentation

{

    class Program

    {

        private static float[] ToFeatures(ILookup<string, int> lookup, int count)

        {

            var result = new float[count];

            foreach (var item in lookup)

            {

                var key = Convert.ToInt32(item.Key) - 1;

                result[key] = item.Sum();

            }

            return result;

        }

        static readonly string _offersCsv = Path.Combine(Environment.CurrentDirectory, "assets", "offers.csv");

        static readonly string _transactionsCsv = Path.Combine(Environment.CurrentDirectory, "assets", "transactions.csv");

        static readonly string _plotSvg = Path.Combine(Environment.CurrentDirectory, "assets", "customerSegmentation.svg");

        static void Main(string[] args)

        {

            var offers = Offer.ReadFromCsv(_offersCsv);

            var transactions = Transaction.ReadFromCsv(_transactionsCsv);

            var clusterData = (from of in offers

                               join tr in transactions on of.OfferId equals tr.OfferId

                               select new

                               {

                                   of.OfferId,

                                   of.Campaign,

                                   of.Discount,

                                   tr.LastName,

                                   of.LastPeak,

                                   of.Minimum,

                                   of.Origin,

                                   of.Varietal,

                                   Count = 1,

                               }).ToArray();

            var count = offers.Count();

            var pivotDataArray =

                (from c in clusterData

                 group c by c.LastName into gcs

                 let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count)

                 select new PivotData()

                 {

                     LastName = gcs.Key,

                     Features = ToFeatures(lookup, count)

                 }).ToArray();

            var mlContext = new MLContext();

            var schemaDef = SchemaDefinition.Create(typeof(PivotData));

            schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count);

            var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef);

            var dataProcessPipeline = new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2)

                                .Append(new OneHotEncodingEstimator(mlContext,

                                new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) }));

            var trainer = mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3);

            var trainingPipeline = dataProcessPipeline.Append(trainer);

            ITransformer trainedModel = trainingPipeline.Fit(pivotDataView);

            var predictions = trainedModel.Transform(pivotDataView);

            var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features");

            Console.WriteLine($"*************************************************");

            Console.WriteLine($"*       Metrics for {trainer} clustering model      ");

            Console.WriteLine($"*------------------------------------------------");

            Console.WriteLine($"*       AvgMinScore: {metrics.AvgMinScore}");

            Console.WriteLine($"*       DBI is: {metrics.Dbi}");

            Console.WriteLine($"*************************************************");

            var clusteringPredictions = predictions

                .AsEnumerable<ClusteringPrediction>(mlContext, false)

                .ToArray();

            var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true };

            var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x);

            foreach (var cluster in clusters)

            {

                var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true };

                var series = clusteringPredictions

                    .Where(p => p.SelectedClusterId == cluster)

                    .Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray();

                scatter.Points.AddRange(series);

                plot.Series.Add(scatter);

            }

            plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors;

            var exporter = new SvgExporter { Width = 600, Height = 400 };

            using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create))

            {

                exporter.Export(plot, fs);

            }

            Console.Read();

        }

    }

}

Offer类：

using System.Collections.Generic;

using System.IO;

using System.Linq;

namespace CustomerSegmentation.DataStructures

{

    public class Offer

    {

        //Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak

        public string OfferId { get; set; }

        public string Campaign { get; set; }

        public string Varietal { get; set; }

        public float Minimum { get; set; }

        public float Discount { get; set; }

        public string Origin { get; set; }

        public string LastPeak { get; set; }

        public static IEnumerable<Offer> ReadFromCsv(string file)

        {

            return File.ReadAllLines(file)

             .Skip(1) // skip header

             .Select(x => x.Split(','))

             .Select(x => new Offer()

             {

                 OfferId = x[0],

                 Campaign = x[1],

                 Varietal = x[2],

                 Minimum = float.Parse(x[3]),

                 Discount = float.Parse(x[4]),

                 Origin = x[5],

                 LastPeak = x[6]

             });

        }

    }

}

Transaction类：

using System.Collections.Generic;

using System.IO;

using System.Linq;

namespace CustomerSegmentation.DataStructures

{

    public class Transaction

    {

        //Customer Last Name,Offer #

        //Smith,2

        public string LastName { get; set; }

        public string OfferId { get; set; }

        public static IEnumerable<Transaction> ReadFromCsv(string file)

        {

            return File.ReadAllLines(file)

             .Skip(1) // skip header

             .Select(x => x.Split(','))

             .Select(x => new Transaction()

             {

                 LastName = x[0],

                 OfferId = x[1],

             });

        }

    }

}

PivotData类：

namespace CustomerSegmentation.DataStructures

{

    public class PivotData

    {

        public float[] Features;

        public string LastName;

    }

}

ClusteringPrediction类：

using Microsoft.ML.Runtime.Api;

using System;

using System.Collections.Generic;

using System.Text;

namespace CustomerSegmentation.DataStructures

{

    public class ClusteringPrediction

    {

        [ColumnName("PredictedLabel")]

        public uint SelectedClusterId;

        [ColumnName("Score")]

        public float[] Distance;

        [ColumnName("PCAFeatures")]

        public float[] Location;

        [ColumnName("LastName")]

        public string LastName;

    }

}

ML.NET教程之客户细分(聚类问题)的更多相关文章

ML.NET 示例：聚类之客户细分
写在前面准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...
数据挖掘应用案例：RFM模型分析与客户细分（转）
正好刚帮某电信行业完成一个数据挖掘工作,其中的RFM模型还是有一定代表性,就再把数据挖掘RFM模型的建模思路细节与大家分享一下吧!手机充值业务是一项主要电信业务形式,客户的充值行为记录正好满足RFM模 ...
RFM模型的应用 - 电商客户细分（转）
RFM模型是网点衡量当前用户价值和客户潜在价值的重要工具和手段.RFM是Rencency(最近一次消费),Frequency(消费频率).Monetary(消费金额) 消费指的是客户在店铺消费最近一次 ...
ML.NET教程之情感分析(二元分类问题)
机器学习的工作流程分为以下几个步骤: 理解问题准备数据加载数据提取特征构建与训练训练模型评估模型运行使用模型理解问题本教程需要解决的问题是根据网站内评论的意见采取合适的行动. 可用 ...
ML.NET技术研究系列-2聚类算法KMeans
上一篇博文我们介绍了ML.NET 的入门: ML.NET技术研究系列1-入门篇本文我们继续,研究分享一下聚类算法k-means. 一.k-means算法简介 k-means算法是一种聚类算法,所谓聚 ...
ML.NET教程之出租车车费预测(回归问题)
理解问题出租车的车费不仅与距离有关,还涉及乘客数量,是否使用信用卡等因素(这是的出租车是指纽约市的).所以并不是一个简单的一元方程问题. 准备数据建立一控制台应用程序工程,新建Data文件夹,在其 ...
oracle基础教程oracle客户端详解
oracle基础教程oracle客户端工具详解参考网址:http://www.oraclejsq.com/article/010100114.html 该教程介绍了oracle自带客户端sqlplu ...
ML.NET 示例：开篇
写在前面准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...
ML.NET 示例：目录
ML.NET 示例中文版:https://github.com/feiyun0112/machinelearning-samples.zh-cn 英文原版请访问:https://github.com/ ...

随机推荐

Emacs 设置C++代码风格
;; C++代码风格设置 (defconst cobbcpp '("linux" ; this is inheritance from the linux style (c-bas ...
LaTeX 中使两张表格并排
在使用 LaTeX写论文或者画海报的时候,希望两张较小的表格可以并排,(一般情况的LaTeX插入两张图片是上下布局的) 查找了一下,相关的例子如下: \begin{minipage}{\textwid ...
Xcode 常用设置
1.main 文件注释 1)main 文件注释修改路径 /Applications/Xcode.app/Contents/Developer/Library/Xcode/Templates/Proje ...
google全球地址
IP Addresses of Google Global Cache www.kookle.co.nr Bulgaria 93.123.23.1 93.123.23.2 93.123.23.3 93 ...
浏览器URL参数解决方案
function getUrlParams() { var search = window.location.search; // 写入数据字典 , search.length).split(&quo ...
0x02 Spring Cloud 学习文档
每个Spring项目都有自己的; 它详细解释了如何使用项目功能以及使用它们可以实现的功能. Spring Cloud 版本参考文档 API文档 Finchley SR2 CURRENT GA Ref ...
rocketmq 源码
https://github.com/YunaiV/incubator-rocketmq
pdfcrop不能使用
最近,用到了pdfcrop,用来去除pdf中空白的边. 但是使用pdfcrop --margins 0 *.pdf 后,给出了错误: Error: pdfcrop cannot call ghost ...
default listener is not configured in grid infrastructure home
Oracle Restart enable database creation requries Default listener configured and running in Grid Inf ...
【Unity】不能新建项目
问题:Unity5.5.2f1今天遇到个Bug,在启动器点击新建项目没有反应. 办法:先点击新建项目(没有反应),再点击Sign Out退出登录,然后再登录进来,就能跳到新建项目页面.

ML.NET教程之客户细分(聚类问题)