理解问题

客户细分需要解决的问题是按照客户之间的相似特征区分不同客户群体。这个问题的先决条件中没有可供使用的客户分类列表,只有客户的人物画像。

数据集

已有的数据是公司的历史商业活动记录以及客户的购买记录。

offer.csv:

Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak
1,January,Malbec,72,56,France,FALSE
2,January,Pinot Noir,72,17,France,FALSE
3,February,Espumante,144,32,Oregon,TRUE
4,February,Champagne,72,48,France,TRUE
5,February,Cabernet Sauvignon,144,44,New Zealand,TRUE
6,March,Prosecco,144,86,Chile,FALSE
7,March,Prosecco,6,40,Australia,TRUE
8,March,Espumante,6,45,South Africa,FALSE
9,April,Chardonnay,144,57,Chile,FALSE
10,April,Prosecco,72,52,California,FALSE
11,May,Champagne,72,85,France,FALSE
12,May,Prosecco,72,83,Australia,FALSE
13,May,Merlot,6,43,Chile,FALSE
14,June,Merlot,72,64,Chile,FALSE
15,June,Cabernet Sauvignon,144,19,Italy,FALSE
16,June,Merlot,72,88,California,FALSE
17,July,Pinot Noir,12,47,Germany,FALSE
18,July,Espumante,6,50,Oregon,FALSE
19,July,Champagne,12,66,Germany,FALSE
20,August,Cabernet Sauvignon,72,82,Italy,FALSE
21,August,Champagne,12,50,California,FALSE
22,August,Champagne,72,63,France,FALSE
23,September,Chardonnay,144,39,South Africa,FALSE
24,September,Pinot Noir,6,34,Italy,FALSE
25,October,Cabernet Sauvignon,72,59,Oregon,TRUE
26,October,Pinot Noir,144,83,Australia,FALSE
27,October,Champagne,72,88,New Zealand,FALSE
28,November,Cabernet Sauvignon,12,56,France,TRUE
29,November,Pinot Grigio,6,87,France,FALSE
30,December,Malbec,6,54,France,FALSE
31,December,Champagne,72,89,France,FALSE
32,December,Cabernet Sauvignon,72,45,Germany,TRUE

transaction.csv:

Customer Last Name,Offer #
Smith,2
Smith,24
Johnson,17
Johnson,24
Johnson,26
Williams,18
Williams,22
Williams,31
Brown,7
Brown,29
Brown,30
Jones,8
Miller,6
Miller,10
Miller,14
Miller,15
Miller,22
Miller,23
Miller,31
Davis,12
Davis,22
Davis,25
Garcia,14
Garcia,15
Rodriguez,2
Rodriguez,26
Wilson,8
Wilson,30
Martinez,12
Martinez,25
Martinez,28
Anderson,24
Anderson,26
Taylor,7
Taylor,18
Taylor,29
Taylor,30
Thomas,1
Thomas,4
Thomas,9
Thomas,11
Thomas,14
Thomas,26
Hernandez,28
Hernandez,29
Moore,17
Moore,24
Martin,2
Martin,11
Martin,28
Jackson,1
Jackson,2
Jackson,11
Jackson,15
Jackson,22
Thompson,9
Thompson,16
Thompson,25
Thompson,30
White,14
White,22
White,25
White,30
Lopez,9
Lopez,11
Lopez,15
Lopez,16
Lopez,27
Lee,3
Lee,4
Lee,6
Lee,22
Lee,27
Gonzalez,9
Gonzalez,31
Harris,4
Harris,6
Harris,7
Harris,19
Harris,22
Harris,27
Clark,4
Clark,11
Clark,28
Clark,31
Lewis,7
Lewis,8
Lewis,30
Robinson,7
Robinson,29
Walker,18
Walker,29
Perez,18
Perez,30
Hall,11
Hall,22
Young,6
Young,9
Young,15
Young,22
Young,31
Young,32
Allen,9
Allen,27
Sanchez,4
Sanchez,5
Sanchez,14
Sanchez,15
Sanchez,20
Sanchez,22
Sanchez,26
Wright,4
Wright,6
Wright,21
Wright,27
King,7
King,13
King,18
King,29
Scott,6
Scott,14
Scott,23
Green,7
Baker,7
Baker,10
Baker,19
Baker,31
Adams,18
Adams,29
Adams,30
Nelson,3
Nelson,4
Nelson,8
Nelson,31
Hill,8
Hill,13
Hill,18
Hill,30
Ramirez,9
Campbell,2
Campbell,24
Campbell,26
Mitchell,1
Mitchell,2
Roberts,31
Carter,7
Carter,13
Carter,29
Carter,30
Phillips,17
Phillips,24
Evans,22
Evans,27
Turner,4
Turner,6
Turner,27
Turner,31
Torres,8
Parker,11
Parker,16
Parker,20
Parker,29
Parker,31
Collins,11
Collins,30
Edwards,8
Edwards,27
Stewart,8
Stewart,29
Stewart,30
Flores,17
Flores,24
Morris,17
Morris,24
Morris,26
Nguyen,19
Nguyen,31
Murphy,7
Murphy,12
Rivera,7
Rivera,18
Cook,24
Cook,26
Rogers,3
Rogers,7
Rogers,8
Rogers,19
Rogers,21
Rogers,22
Morgan,8
Morgan,29
Peterson,1
Peterson,2
Peterson,10
Peterson,23
Peterson,26
Peterson,27
Cooper,4
Cooper,16
Cooper,20
Cooper,32
Reed,5
Reed,14
Bailey,7
Bailey,30
Bell,2
Bell,17
Bell,24
Bell,26
Gomez,11
Gomez,20
Gomez,25
Gomez,32
Kelly,6
Kelly,20
Kelly,31
Kelly,32
Howard,11
Howard,12
Howard,22
Ward,4
Cox,2
Cox,17
Cox,24
Cox,26
Diaz,7
Diaz,8
Diaz,29
Diaz,30
Richardson,3
Richardson,6
Richardson,22
Wood,1
Wood,10
Wood,14
Wood,31
Watson,7
Watson,29
Brooks,3
Brooks,8
Brooks,11
Brooks,22
Bennett,8
Bennett,29
Gray,12
Gray,16
Gray,26
James,7
James,8
James,13
James,18
James,30
Reyes,9
Reyes,23
Cruz,29
Cruz,30
Hughes,7
Hughes,8
Hughes,13
Hughes,29
Hughes,30
Price,1
Price,22
Price,30
Price,31
Myers,18
Myers,30
Long,3
Long,7
Long,10
Foster,1
Foster,9
Foster,14
Foster,22
Foster,23
Sanders,1
Sanders,4
Sanders,5
Sanders,6
Sanders,9
Sanders,11
Sanders,20
Sanders,25
Sanders,26
Ross,18
Ross,21
Morales,6
Morales,7
Morales,8
Morales,19
Morales,22
Morales,31
Powell,5
Sullivan,8
Sullivan,13
Sullivan,18
Russell,26
Ortiz,8
Jenkins,24
Jenkins,26
Gutierrez,6
Gutierrez,8
Gutierrez,10
Gutierrez,18
Perry,8
Perry,18
Perry,29
Perry,30
Butler,1
Butler,4
Butler,22
Butler,28
Butler,30
Barnes,10
Barnes,21
Barnes,22
Barnes,31
Fisher,1
Fisher,2
Fisher,11
Fisher,22
Fisher,28
Fisher,30
Fisher,31

预处理

需要对两个数据集做关联处理,这样才能得到单一的视图。同时由于需要比较客户所产生的交易,还需要建立一张透视表。行代表客户,列代表商业活动,单元格值则显示是否客户有购买行为。

var offers = Offer.ReadFromCsv(_offersCsv);
var transactions = Transaction.ReadFromCsv(_transactionsCsv); var clusterData = (from of in offers
join tr in transactions on of.OfferId equals tr.OfferId
select new
{
of.OfferId,
of.Campaign,
of.Discount,
tr.LastName,
of.LastPeak,
of.Minimum,
of.Origin,
of.Varietal,
Count = 1,
}).ToArray(); var count = offers.Count();
var pivotDataArray =
(from c in clusterData
group c by c.LastName into gcs
let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count)
select new PivotData()
{
LastName = gcs.Key,
Features = ToFeatures(lookup, count)
}).ToArray();

ToFeatures方法依据商业活动的数量,生成所需的特征数组。

private static float[] ToFeatures(ILookup<string, int> lookup, int count)
{
var result = new float[count];
foreach (var item in lookup)
{
var key = Convert.ToInt32(item.Key) - 1;
result[key] = item.Sum();
} return result;
}

数据视图

取得用于生成视图的数组后,这里使用CreateStreamingDataView方法构建数据视图。而又因为Features属性是一个数组,所以必须声明其大小。

var mlContext = new MLContext();

var schemaDef = SchemaDefinition.Create(typeof(PivotData));
schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count); var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef);

PCA

PCA(principal Component Analysis),主成分分析,是为了将过多的维度值减少至一个合适的范围以便于分析,这里是降到二维空间。

new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2)

OneHotEncoding

One Hot Encoding在此处的作用是将LastName从字符串转换为数字矩阵。

new OneHotEncodingEstimator(mlContext, new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) })

训练器

K-Means是常用的应对聚类问题的训练器,这里假设要分为三类。

mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3)

训练模型

trainingPipeline.Fit(pivotDataView);

评估模型

var predictions = trainedModel.Transform(pivotDataView);
var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features"); Console.WriteLine($"*************************************************");
Console.WriteLine($"* Metrics for {trainer} clustering model ");
Console.WriteLine($"*------------------------------------------------");
Console.WriteLine($"* AvgMinScore: {metrics.AvgMinScore}");
Console.WriteLine($"* DBI is: {metrics.Dbi}");
Console.WriteLine($"*************************************************");

可得到如下的评估结果。

*************************************************
* Metrics for Microsoft.ML.Trainers.KMeans.KMeansPlusPlusTrainer clustering model
*------------------------------------------------
* AvgMinScore: 2.3154067927599
* DBI is: 2.69100740819456
*************************************************

使用模型

var clusteringPredictions = predictions
.AsEnumerable<ClusteringPrediction>(mlContext, false)
.ToArray();

画图

为了更直观地观察,可以用OxyPlot类库生成结果图片。

添加类库:

dotnet add package OxyPlot.Core

Plot生成处理:

var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true };

var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x);

foreach (var cluster in clusters)
{
var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true };
var series = clusteringPredictions
.Where(p => p.SelectedClusterId == cluster)
.Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray();
scatter.Points.AddRange(series);
plot.Series.Add(scatter);
} plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors; var exporter = new SvgExporter { Width = 600, Height = 400 };
using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create))
{
exporter.Export(plot, fs);
}

最后的图片如下所示:

完整示例代码

Program类:

using CustomerSegmentation.DataStructures;
using Microsoft.ML;
using System;
using System.IO;
using System.Linq;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Transforms.Projections;
using Microsoft.ML.Transforms.Categorical;
using Microsoft.ML.Runtime.Data;
using OxyPlot;
using OxyPlot.Series;
using Microsoft.ML.Core.Data; namespace CustomerSegmentation
{
class Program
{
private static float[] ToFeatures(ILookup<string, int> lookup, int count)
{
var result = new float[count];
foreach (var item in lookup)
{
var key = Convert.ToInt32(item.Key) - 1;
result[key] = item.Sum();
} return result;
} static readonly string _offersCsv = Path.Combine(Environment.CurrentDirectory, "assets", "offers.csv");
static readonly string _transactionsCsv = Path.Combine(Environment.CurrentDirectory, "assets", "transactions.csv");
static readonly string _plotSvg = Path.Combine(Environment.CurrentDirectory, "assets", "customerSegmentation.svg"); static void Main(string[] args)
{
var offers = Offer.ReadFromCsv(_offersCsv);
var transactions = Transaction.ReadFromCsv(_transactionsCsv); var clusterData = (from of in offers
join tr in transactions on of.OfferId equals tr.OfferId
select new
{
of.OfferId,
of.Campaign,
of.Discount,
tr.LastName,
of.LastPeak,
of.Minimum,
of.Origin,
of.Varietal,
Count = 1,
}).ToArray(); var count = offers.Count();
var pivotDataArray =
(from c in clusterData
group c by c.LastName into gcs
let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count)
select new PivotData()
{
LastName = gcs.Key,
Features = ToFeatures(lookup, count)
}).ToArray(); var mlContext = new MLContext(); var schemaDef = SchemaDefinition.Create(typeof(PivotData));
schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count); var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef); var dataProcessPipeline = new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2)
.Append(new OneHotEncodingEstimator(mlContext,
new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) })); var trainer = mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3);
var trainingPipeline = dataProcessPipeline.Append(trainer); ITransformer trainedModel = trainingPipeline.Fit(pivotDataView); var predictions = trainedModel.Transform(pivotDataView);
var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features"); Console.WriteLine($"*************************************************");
Console.WriteLine($"* Metrics for {trainer} clustering model ");
Console.WriteLine($"*------------------------------------------------");
Console.WriteLine($"* AvgMinScore: {metrics.AvgMinScore}");
Console.WriteLine($"* DBI is: {metrics.Dbi}");
Console.WriteLine($"*************************************************"); var clusteringPredictions = predictions
.AsEnumerable<ClusteringPrediction>(mlContext, false)
.ToArray(); var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true }; var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x); foreach (var cluster in clusters)
{
var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true };
var series = clusteringPredictions
.Where(p => p.SelectedClusterId == cluster)
.Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray();
scatter.Points.AddRange(series);
plot.Series.Add(scatter);
} plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors; var exporter = new SvgExporter { Width = 600, Height = 400 };
using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create))
{
exporter.Export(plot, fs);
} Console.Read();
}
}
}

Offer类:

using System.Collections.Generic;
using System.IO;
using System.Linq; namespace CustomerSegmentation.DataStructures
{
public class Offer
{
//Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak
public string OfferId { get; set; }
public string Campaign { get; set; }
public string Varietal { get; set; }
public float Minimum { get; set; }
public float Discount { get; set; }
public string Origin { get; set; }
public string LastPeak { get; set; } public static IEnumerable<Offer> ReadFromCsv(string file)
{
return File.ReadAllLines(file)
.Skip(1) // skip header
.Select(x => x.Split(','))
.Select(x => new Offer()
{
OfferId = x[0],
Campaign = x[1],
Varietal = x[2],
Minimum = float.Parse(x[3]),
Discount = float.Parse(x[4]),
Origin = x[5],
LastPeak = x[6]
});
}
}
}

Transaction类:

using System.Collections.Generic;
using System.IO;
using System.Linq; namespace CustomerSegmentation.DataStructures
{
public class Transaction
{
//Customer Last Name,Offer #
//Smith,2
public string LastName { get; set; }
public string OfferId { get; set; } public static IEnumerable<Transaction> ReadFromCsv(string file)
{
return File.ReadAllLines(file)
.Skip(1) // skip header
.Select(x => x.Split(','))
.Select(x => new Transaction()
{
LastName = x[0],
OfferId = x[1],
});
}
}
}

PivotData类:

namespace CustomerSegmentation.DataStructures
{
public class PivotData
{
public float[] Features;
public string LastName;
}
}

ClusteringPrediction类:

using Microsoft.ML.Runtime.Api;
using System;
using System.Collections.Generic;
using System.Text; namespace CustomerSegmentation.DataStructures
{
public class ClusteringPrediction
{
[ColumnName("PredictedLabel")]
public uint SelectedClusterId;
[ColumnName("Score")]
public float[] Distance;
[ColumnName("PCAFeatures")]
public float[] Location;
[ColumnName("LastName")]
public string LastName;
}
}

ML.NET教程之客户细分(聚类问题)的更多相关文章

  1. ML.NET 示例:聚类之客户细分

    写在前面 准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...

  2. 数据挖掘应用案例:RFM模型分析与客户细分(转)

    正好刚帮某电信行业完成一个数据挖掘工作,其中的RFM模型还是有一定代表性,就再把数据挖掘RFM模型的建模思路细节与大家分享一下吧!手机充值业务是一项主要电信业务形式,客户的充值行为记录正好满足RFM模 ...

  3. RFM模型的应用 - 电商客户细分(转)

    RFM模型是网点衡量当前用户价值和客户潜在价值的重要工具和手段.RFM是Rencency(最近一次消费),Frequency(消费频率).Monetary(消费金额) 消费指的是客户在店铺消费最近一次 ...

  4. ML.NET教程之情感分析(二元分类问题)

    机器学习的工作流程分为以下几个步骤: 理解问题 准备数据 加载数据 提取特征 构建与训练 训练模型 评估模型 运行 使用模型 理解问题 本教程需要解决的问题是根据网站内评论的意见采取合适的行动. 可用 ...

  5. ML.NET技术研究系列-2聚类算法KMeans

    上一篇博文我们介绍了ML.NET 的入门: ML.NET技术研究系列1-入门篇 本文我们继续,研究分享一下聚类算法k-means. 一.k-means算法简介 k-means算法是一种聚类算法,所谓聚 ...

  6. ML.NET教程之出租车车费预测(回归问题)

    理解问题 出租车的车费不仅与距离有关,还涉及乘客数量,是否使用信用卡等因素(这是的出租车是指纽约市的).所以并不是一个简单的一元方程问题. 准备数据 建立一控制台应用程序工程,新建Data文件夹,在其 ...

  7. oracle基础教程oracle客户端详解

    oracle基础教程oracle客户端工具详解 参考网址:http://www.oraclejsq.com/article/010100114.html 该教程介绍了oracle自带客户端sqlplu ...

  8. ML.NET 示例:开篇

    写在前面 准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...

  9. ML.NET 示例:目录

    ML.NET 示例中文版:https://github.com/feiyun0112/machinelearning-samples.zh-cn 英文原版请访问:https://github.com/ ...

随机推荐

  1. 基于Centos搭建 Hadoop 伪分布式环境

    软硬件环境: CentOS 7.2 64 位, OpenJDK- 1.8,Hadoop- 2.7 关于本教程的说明 云实验室云主机自动使用 root 账户登录系统,因此本教程中所有的操作都是以 roo ...

  2. 详解C#特性和反射(一)

    使用特性(Attribute)可以将描述程序集的信息和描述程序集中任何类型和成员的信息添加到程序集的元数据和IL代码中,程序可以在运行时通过反射获取到这些信息: 一.通过直接或间接的继承自抽象类Sys ...

  3. 电子证书 DER & PEM & CRT & CER

    原文链接: http://blog.csdn.net/zqt520/article/details/26966603 证书与编码 本至上,X.509证书是一个数字文档,这个文档根据RFC 5280来编 ...

  4. eoe移动开发者大会—移动开发者的极客盛宴 2013年9月14日期待您的加入!!

    2013 eoe移动开发者大会北京站即将盛大开启!      大会介绍 由国内最大的移动开发者社区eoe主办,在行业盟友的倾力支持下,集合了来自微软.Google.亚马逊.ARM等跨国公司业务负责人的 ...

  5. [svc]find+xargs/sed&sed后向引用+awk多匹配符+过滤行绝招总结&&产生随机数

    30天内的文件打包 find ./test_log -type f -mtime -30|xargs tar -cvf test_log.tar.gz find,文件+超过7天+超过1M的+按日期为文 ...

  6. leetcode笔记:3Sum Closest

    一.题目描写叙述 二.解题技巧 该题与3Sum的要求类似.不同的是要求选出的组合的和与目标值target最接近而不一定相等.但实际上,与3Sum的算法流程思路类似,先是进行排序.然后顺序选择数组A中的 ...

  7. linux source code search

    https://elixir.bootlin.com/linux/latest/source/fs/eventpoll.c#L1120

  8. [Big Data - ELK] ELK(ElasticSearch, Logstash, Kibana)搭建实时日志分析平台

    ELK平台介绍 在搜索ELK资料的时候,发现这篇文章比较好,于是摘抄一小段: 以下内容来自: http://baidu.blog.51cto.com/71938/1676798 日志主要包括系统日志. ...

  9. mxnet:背景介绍

    学习的过程 使用mxnet作为教程的深度学习库,重点介绍高层抽象包gluon 双轨学习法,既教授大家从零实现,也教授大家使用gluon实现模型:前者为了理解深度学习的底层设计,后者将大家从繁琐的模型设 ...

  10. JVM:基础

    参考 温绍景-Java虚拟机基础