Kaggle的Outbrain点击预测比赛分析

https://yq.aliyun.com/articles/293596

https://www.kaggle.com/c/outbrain-click-prediction

https://www.kaggle.com/anokas/outbrain-eda

用户个性化点击率预估

基本场景：

document_id(document)　　uuid(user)　　ad_id(a set of ads)

原始数据：

page_views.csv: the log of users visiting documents

uuid
document_id
timestamp (ms since 1970-01-01 - 1465876799998)
platform (desktop = 1, mobile = 2, tablet =3)
geo_location (country>state>DMA)
traffic_source (internal = 1, search = 2, social = 3)

clicks_train.csv:

display_id
ad_id
clicked (1 if clicked, 0 otherwise)

events.csv: (information on the display_id context)

display_id
uuid
document_id
timestamp
platform
geo_location

promoted_content.csv: details on the ads.

ad_id
document_id
campaign_id
advertiser_id

documents_meta.csv: details on the documents.

document_id
source_id (the part of the site on which the document is displayed, e.g. edition.cnn.com)
publisher_id
publish_time

documents_topics.csv, documents_entities.csv, and documents_categories.csv all provide information about the content in a document, as well as Outbrain's confidence in each respective relationship.

数据分析：

import pandas as pd

import os

import gc # We're gonna be clearing memory a lot

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

df_train = pd.read_csv('./outbrain-click-prediction/clicks_train.csv')

df_test = pd.read_csv('./outbrain-click-prediction/clicks_test.csv')

# 页面广告数分布

size_train = df_train.groupby('display_id')['ad_id'].count().value_counts()

size_train = size_train / np.sum(size_train)

直方图：

plt.figure(figsize=(12,4))

p = sns.color_palette()

sns.barplot(size_train.index, size_train.values, alpha=0.8, color=p[0], label='train')

plt.legend()

plt.xlabel('Number of Ads in display', fontsize=12)

plt.ylabel('Proportion of set', fontsize=12)

统计广告出现次数：

# 以下两行都可以

df_train.groupby('ad_id')['ad_id'].count()

df_train.groupby('ad_id').agg(np.size)

统计训练集和测试集中ad的重合度：

len(set(df_test.ad_id.unique()).intersection(df_train.ad_id.unique())) / len(df_test.ad_id.unique())

对events.csv进行统计：

print (events.columns.to_list())

print (events.head())

print (events.platform.value_counts())

events.platform = events.platform.astype(str)

print (events.platform.value_counts())

print (events.groupby('uuid')['uuid'].count().sort_values()) # 统计用户的出现次数

Kaggle的Outbrain点击预测比赛分析的更多相关文章

Kaggle 自行车租赁预测比赛项目实现
作者:大树更新时间:01.20 email:59888745@qq.com 数据处理,机器学习回主目录:2017 年学习记录和总结 .caret, .dropup > .btn > . ...
Kaggle 广告转化率预测比赛小结
20天的时间参加了Kaggle的 Avito Demand Prediction Challenged ,第一次参加,成绩离奖牌一步之遥,感谢各位队友,学到的东西远比成绩要丰硕得多.作为新手,希望每记 ...
KDDCUP CTR预测比赛总结
赛题与数据介绍给定查询和用户信息后预测广告点击率搜索广告是近年来互联网的主流营收来源之一.在搜索广告背后,一个关键技术就是点击率预测-----pCTR(predict the click-thro ...
kaggle之泰坦尼克号乘客死亡预测
目录前言相关性分析数据数据特点相关性分析数据预处理预测模型 Logistic回归训练模型模型优化前言一般接触kaggle的入门题,已知部分乘客的年龄性别船舱等信息,预测其存活情况, ...
kaggle首秀之intel癌症预测（续篇）
之前写了这篇文章.现在把他搬到知乎live上了.书非借不能读也,因此搞了点小费用,如果你觉得贵,加我微信我给你发红包返回给你. 最近的空余时间拿去搞kaggle了, 好久没更新文章了.今天写写kagg ...
talkingdata比赛分析
1.kaggle数据分析经验: https://medium.com/unstructured/how-feature-engineering-can-help-you-do-well-in-a-ka ...
由Kaggle竞赛wiki文章流量预测引发的pandas内存优化过程分享
pandas内存优化分享缘由最近在做Kaggle上的wiki文章流量预测项目,这里由于个人电脑配置问题,我一直都是用的Kaggle的kernel,但是我们知道kernel的内存限制是16G,如下: ...
SILK 预测模块分析
SILK是一种新结构的基于噪声整形量化算法的编解码框架.不同于类CELP的AMR,EVRC,G729,Speex等标准. 类CELP的结构都是以码本激励为量化框架的编码器.但是这里并不讨论NSQ结构和 ...
ACM-ICPC 训练平台 & 比赛分析
the file can download in https://pan.baidu.com/s/1HwoLFHGAG-boQbIn9xIhxA occ5 the article is also pu ...

随机推荐

HDU 5371 Manacher Hotaru's problem
求出一个连续子序列,这个子序列由三部分ABC构成,其中AB是回文串,A和C相同,也就是BC也是回文串. 求这样一个最长的子序列. Manacher算法是在所有两个相邻数字之间插入一个特殊的数字,比如- ...
Python之code对象与pyc文件（三）
上一节:Python之code对象与pyc文件(二) 向pyc写入字符串在了解Python如何将字符串写入到pyc文件的机制之前,我们先来了解一下结构体WFILE: marshal.c typede ...
基于百度OCR的图片文字识别
先上图,有图有真相首先在百度开通ORC服务,目前是免费的,普通识别每天50000次免费,非常棒! 百度文档:http://ai.baidu.com/docs#/OCR-API/top 下载百度SDK ...
25、Base64
Base64要求把每三个8Bit的字节转换为四个6Bit的字节(3*8 = 4*6 = 24),然后把6Bit再添两位高位0,组成四个8Bit的字节,也就是说,转换后的字符串理论上将要比原来的长1/3 ...
大数据学习——SparkStreaming整合Kafka完成网站点击流实时统计
1.安装并配置zk 2.安装并配置Kafka 3.启动zk 4.启动Kafka 5.创建topic [root@mini3 kafka]# bin/kafka-console-producer. -- ...
JAVA-STRUTS-2x的项目配置
首先是web.xml的配置,这个是项目加载的开始. <display-name></display-name>  <fil ...
Python 调用multiprocessing模块下面的Process类方法（实现服务器、客户端并发）-UDP协议
#基于UDP协议的multiprocessing自定义通信服务端: from multiprocessing import Process import socket def task(server ...
Codeforces Round #470 (rated, Div. 2, based on VK Cup 2018 Round 1)
A. Protect Sheep time limit per test 1 second memory limit per test 256 megabytes input standard inp ...
[错误处理]UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
Stackoverflow 回答: 将byte类型转化:byte_string.decode('utf-8') Jinja2 is using Unicode internally which mea ...
九度oj 题目1470：调整方阵
题目描述: 输入一个N(N<=10)阶方阵,按照如下方式调整方阵:1.将第一列中最大数所在的行与第一行对调.2.将第二列中从第二行到第N行最大数所在的行与第二行对调. 依此类推...N-1.将第 ...

Kaggle的Outbrain点击预测比赛分析

Kaggle的Outbrain点击预测比赛分析的更多相关文章

随机推荐

热门专题