Mercari Price Suggestion in Kaggle
Mercari Price Suggestion
最近看到了一个竞赛,竞赛的内容是根据已知的商品的描述,品牌,品类,物品的状态等特征来预测商品的价格
最后的评估标准为 平均算术平方根误差Root Mean Squared Logarithmic Error.
\[\epsilon = \sqrt { \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \left( \log \left( p _ { i } + 1 \right) - \log \left( a _ { i } + 1 \right) \right) ^ { 2 } }
\]最后提交的文件为test_id ,price 包含两列数据,一列为测试数据中id,另一列为预测的价格
训练集或者测试集中包括以下特征
- train_id test_id 物品的编号,一个商品对应一个编号
- name 名称
- item_condition_id 物品状态
- category_name 品类
- brand_name 品牌
- price 物品售出的价格,测试集中不包含此列,此列也为我们要预测的值
- shipping 1 if shipping fee is paid by seller and 0 by buyer,也就是1代表包邮,0代表不包邮
- item_description 物品的详细描述,描述中已经除去带有价格标签的值,已用[rm]代替
import pandas as pd
import numpy as np
df = pd.read_csv('input/train.tsv',sep='\t')
data information
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
| 1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
| 2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
| 3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
| 4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
train_id 1482535 non-null int64
name 1482535 non-null object
item_condition_id 1482535 non-null int64
category_name 1476208 non-null object
brand_name 849853 non-null object
price 1482535 non-null float64
shipping 1482535 non-null int64
item_description 1482531 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB
price distribution
df.price.describe()
count 1.482535e+06
mean 2.673752e+01
std 3.858607e+01
min 0.000000e+00
25% 1.000000e+01
50% 1.700000e+01
75% 2.900000e+01
max 2.009000e+03
Name: price, dtype: float64
import matplotlib.pyplot as plt
plt.subplot(1, 2, 1) # 要生成一行两列,这是第一个图plt.subplot('行','列','编号')
df.price.plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)
plt.subplot(1, 2, 2)
np.log((df.price+1)).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white')
plt.xlabel('log(price+1)', fontsize=12)
plt.title('log(Price+1) Distribution', fontsize=12)
Text(0.5, 1.0, 'log(Price+1) Distribution')

- 价格特征为左偏态,需要将其转化为正太分布的数据,价格的分布主要集中在10-20左右,而最大的价格在2009,需要将其做对数转化,转化后,其对数分布为较为规则的正态分布
包邮对于价格影响
df['shipping'].value_counts(normalize=True)
0 0.552726
1 0.447274
Name: shipping, dtype: float64
- 对于商家是否包邮,55%的商品不包邮,44.7%的商品包邮,需要看一下包邮是否对于价格影响
shipping_yes = df.loc[df['shipping'] == 1, 'price'] # 商家出运费
shipping_no = df.loc[df['shipping'] == 0, 'price'] # 买家出运费
fig,ax = plt.subplots(figsize=(8,5))
ax.hist(shipping_yes,color='r',alpha=0.5,bins=30,range=[0,100],label='shipping_yes')
ax.hist(shipping_no,color='green',alpha=0.5,bins=30,range=[0,100],label=
'shipping_no')
plt.xlabel('price',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('price_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

print("不包邮平均的定价%s dollars" %(round(shipping_no.mean(),2)))
print("包邮平均的定价%s dollars" %(round(shipping_yes.mean(),2)))
不包邮平均的定价30.11 dollars
包邮平均的定价22.57 dollars
fig,ax = plt.subplots(figsize=(8,5))
ax.hist(np.log(shipping_yes+1),color='r',alpha=0.5,bins=50,label='shipping_yes')
ax.hist(np.log(shipping_no+1),color='green',alpha=0.5,bins=50,label=
'shipping_no')
plt.xlabel('log(price+1)',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('log(price+1)_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

处理category 数据
"总共的数据有{}条记录".format(df.shape[0])
'总共的数据有1482535条记录'
- 数据集中的name,cageory,brand,item_condition_id 都需要转化为category类型的数据
df['category_name'].value_counts()
# 总共有1287类型
Women/Athletic Apparel/Pants, Tights, Leggings 60177
Women/Tops & Blouses/T-Shirts 46380
Beauty/Makeup/Face 34335
Beauty/Makeup/Lips 29910
Electronics/Video Games & Consoles/Games 26557
Beauty/Makeup/Eyes 25215
Electronics/Cell Phones & Accessories/Cases, Covers & Skins 24676
Women/Underwear/Bras 21274
Women/Tops & Blouses/Tank, Cami 20284
Women/Tops & Blouses/Blouse 20284
Women/Dresses/Above Knee, Mini 20082
Women/Jewelry/Necklaces 19758
Women/Athletic Apparel/Shorts 19528
Beauty/Makeup/Makeup Palettes 19103
Women/Shoes/Boots 18864
Beauty/Fragrance/Women 18628
Beauty/Skin Care/Face 15836
Women/Women's Handbags/Shoulder Bag 15328
Men/Tops/T-shirts 15108
Women/Dresses/Knee-Length 14770
Women/Athletic Apparel/Shirts & Tops 14738
Women/Shoes/Sandals 14662
Women/Jewelry/Bracelets 14497
Men/Shoes/Athletic 14257
Kids/Toys/Dolls & Accessories 13957
Women/Women's Accessories/Wallets 13616
Women/Jeans/Slim, Skinny 13392
Home/Home Décor/Home Décor Accents 13004
Women/Swimwear/Two-Piece 12758
Women/Shoes/Athletic 12662
...
Men/Suits/Four Button 1
Handmade/Bags and Purses/Other 1
Handmade/Dolls and Miniatures/Primitive 1
Handmade/Furniture/Fixture 1
Handmade/Housewares/Bathroom 1
Handmade/Woodworking/Sculptures 1
Men/Suits/One Button 1
Handmade/Geekery/Housewares 1
Kids/Safety/Crib Netting 1
Vintage & Collectibles/Furniture/Entertainment 1
Home/Furniture/Bathroom Furniture 1
Handmade/Glass/Vases 1
Handmade/Geekery/Videogame 1
Handmade/Woodworking/Sports 1
Handmade/Art/Aceo 1
Vintage & Collectibles/Paper Ephemera/Map 1
Handmade/Patterns/Painting 1
Handmade/Housewares/Cleaning 1
Home/Home Décor/Doorstops 1
Handmade/Accessories/Belt 1
Handmade/Patterns/Accessories 1
Vintage & Collectibles/Housewares/Towel 1
Other/Automotive/RV Parts & Accessories 1
Handmade/Paper Goods/Pad 1
Handmade/Accessories/Cozy 1
Kids/Diapering/Washcloths & Towels 1
Handmade/Pets/Blanket 1
Handmade/Needlecraft/Clothing 1
Handmade/Furniture/Shelf 1
Handmade/Quilts/Bed 1
Name: category_name, Length: 1287, dtype: int64
it_conditon_id vs price
- 常见的箱型图 注释

import seaborn as sns
sns.boxplot(x = 'item_condition_id', y = np.log(df['price']+1), data = df, palette = sns.color_palette('RdBu',5))
<matplotlib.axes._subplots.AxesSubplot at 0x127d5bdd8>

- 不同的物品状态对应的价格千差外别
竞赛杀器lightgbm
- settings
NUM_BRANDS = 4000
NUM_CATEGORIES = 1000
NAME_MIN_DF =10
MAX_FEATURES_ITEM_DESCRIPTION =50000
"There are %d items that do not have a category name" % df['category_name'].isnull().sum()
'There are 6327 items that do not have a category name'
"There are %d items that do not have a brand name" % df['brand_name'].isnull().sum()
'There are 632682 items that do not have a brand name'
"There are %d items that do not have a item_description " % df['item_description'].isnull().sum()
'There are 4 items that do not have a item_description '
def handling_missing_inplace(datasets):
datasets['category_name'].fillna('missing',inplace=True)
datasets['brand_name'].fillna('missing',inplace=True)
datasets['item_description'].replace('No description yet,''missing', inplace=True) # 需要仔细看数据才能看到
datasets['item_description'].fillna(value='missing', inplace=True)
def cutting(datasets):
pop_brand = datasets['brand_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_BRANDS]
datasets.loc[~datasets['brand_name'].isin(pop_brand),'brand_name'] ='missing'
pop_category = datasets['category_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_CATEGORIES]
datasets.loc[~datasets['category_name'].isin(pop_category),'category_name'] ='missing'
def to_category(datasets):
datasets['category_name'] = datasets['category_name'].astype('category')
datasets['brand_name'] = datasets['brand_name'].astype('category')
datasets['item_condition_id'] = datasets['item_condition_id'].astype('category')
- 查看价格的数量分布,发现竟然有价格为0的,所以需要去掉价格为0的数据
df['price'].value_counts().reset_index().sort_values(by='index').head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| index | price | |
|---|---|---|
| 25 | 3.0 | 18703 |
| 28 | 4.0 | 16139 |
| 17 | 5.0 | 31502 |
| 261 | 5.5 | 33 |
| 16 | 6.0 | 32260 |
df=df[df['price']!=0].reset_index(drop=True)
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
| 1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
| 2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
| 3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
| 4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelBinarizer
import lightgbm as lgb
from scipy.sparse import csr_matrix, hstack # 解决稀疏矩阵
# referenc https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html
import gc
import time
from sklearn.linear_model import Ridge
def main():
start_time = time.time()
train = pd.read_table('input/train.tsv', engine='c')
# train=train[train['price']!=0]
test = pd.read_table('input/test_stg2.tsv', engine='c')
print('[{}] Finished to load data'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)
nrow_train = train.shape[0]
y = np.log1p(train["price"])
merge: pd.DataFrame = pd.concat([train, test])
submission: pd.DataFrame = test[['test_id']]
del train
del test
gc.collect()
handling_missing_inplace(merge)
print('[{}] Finished to handle missing'.format(time.time() - start_time))
cutting(merge)
print('[{}] Finished to cut'.format(time.time() - start_time))
to_category(merge)
print('[{}] Finished to convert categorical'.format(time.time() - start_time))
cv = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cv.fit_transform(merge['name'])
print('[{}] Finished count vectorize `name`'.format(time.time() - start_time))
cv = CountVectorizer()
X_category = cv.fit_transform(merge['category_name'])
print('[{}] Finished count vectorize `category_name`'.format(time.time() - start_time))
tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,
ngram_range=(1, 3),
stop_words='english')
X_description = tv.fit_transform(merge['item_description'])
print('[{}] Finished TFIDF vectorize `item_description`'.format(time.time() - start_time))
lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(merge['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))
X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']],
sparse=True).values)
print('[{}] Finished to get dummies on `item_condition_id` and `shipping`'.format(time.time() - start_time))
sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()
print('[{}] Finished to create sparse merge'.format(time.time() - start_time))
X = sparse_merge[:nrow_train]
X_test = sparse_merge[nrow_train:]
#train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.1, random_state = 144)
d_train = lgb.Dataset(X, label=y)
#d_valid = lgb.Dataset(valid_X, label=valid_y, max_bin=8192)
#watchlist = [d_train, d_valid]
params = {
'learning_rate': 0.73,
'application': 'regression',
'max_depth': 3,
'num_leaves': 100,
'verbosity': -1,
'metric': 'RMSE',
}
model = lgb.train(params, train_set=d_train, num_boost_round=3000, verbose_eval=100)
preds = 0.56*model.predict(X_test)
model = Ridge(solver="sag", fit_intercept=True, random_state=42)
model.fit(X, y)
print('[{}] Finished to train ridge'.format(time.time() - start_time))
preds += 0.44*model.predict(X=X_test)
print('[{}] Finished to predict ridge'.format(time.time() - start_time))
submission['price'] = np.expm1(preds)
submission.loc[submission['price'] < 0.0, 'price'] = 0.0
submission.to_csv("sample_submission_stg2.csv", index=False)
if __name__ == '__main__':
main()
Mercari Price Suggestion in Kaggle的更多相关文章
- 使用Pandas: str.replace() 进行文本清洗
前段时间参加了Kaggle上的Mercari Price Suggestion Challenge比赛,收获良多,过些时候准备进行一些全面的总结,本篇文章先谈一个比赛中用到的小技巧. 这个比赛数据中有 ...
- stacking method house price in kaggle top10%
整合几部分代码的汇总 隐藏代码片段 导入python数据和可视化包 导入统计相关的工具 导入回归相关的算法 导入数据预处理相关的方法 导入模型调参相关的包 读取数据 特征工程 缺失值 类别特征处理-l ...
- kaggle house price
kaggle 竞赛入门 导入常用的数据分析以及模型的库 数据处理 Data fields 去除异常值 处理缺失值 分析 Utilities Exploratory Data Analysis Corr ...
- Feature Preprocessing on Kaggle
刚入手data science, 想着自己玩一玩kaggle,玩了新手Titanic和House Price的 项目, 觉得基本的baseline还是可以写出来,但是具体到一些细节,以至于到能拿到的出 ...
- how to update product listing price sale price and sale date using mobile App
Greetings from Amazon Seller Support, Thank you for writing back to us. I have reviewed our previous ...
- kaggle 欺诈信用卡预测——不平衡训练样本的处理方法 综合结论就是:随机森林+过采样(直接复制或者smote后,黑白比例1:3 or 1:1)效果比较好!记得在smote前一定要先做标准化!!!其实随机森林对特征是否标准化无感,但是svm和LR就非常非常关键了
先看数据: 特征如下: Time Number of seconds elapsed between each transaction (over two days) numeric V1 No de ...
- kaggle——分销商产品未来销售情况预测
分销商产品未来销售情况预测 介绍 前面的几个实验中,都是根据提供的数据特征来构建模型,也就是说,数据集中会含有许多的特征列.本次将会介绍如何去处理另一种常见的数据,即时间序列数据.具体来说就是如何根据 ...
- Kaggle竞赛入门:决策树算法的Python实现
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...
- Kaggle竞赛入门(二):如何验证机器学习模型
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...
随机推荐
- python 练习题:将列表中的大写字母转换成小写
将列表中的大写字母转换成小写如果list中既包含字符串,又包含整数,由于非字符串类型没有lower()方法,L1 = ['Hello', 'World', 18, 'Apple', None]请修改列 ...
- Message "'OFFSET' 附近有语法错误。\r\n在 FETCH 语句中选项 NEXT 的用法无效。" 解决办法 EntityFrameworkCore
由于新版的EntityFrameworkCore默认使用的是SqlServer2012或以上版本的Sql语法分页,来提高性能. 所以使用数据库的版本如果低于2012(如Sqlserver2008)需要 ...
- PIE 插件式开发小笔记__PIESDK学习体会
基于PIE.NET-SDK插件式二次开发文档笔记: PIE 插件式开发配置文件: 它里面一行如下: 理解上一行'Item'关系-> library:为插件类名(程序集名称+后缀 ...
- ElasticSearch之安装及基本操作API
ElasticSearch 是目前非常流行的搜索引擎,对海量数据搜索是非常友好,并且在高并发场景下,也能发挥出稳定,快速特点.也是大数据和索搜服务的开发人员所极力追捧的中间件.虽然 ElasticSe ...
- 关于spring中请求返回值的json序列化/反序列化问题
https://cloud.tencent.com/developer/article/1381083 https://www.jianshu.com/p/db07543ffe0a 先留个坑
- 中国工业的下一个十年在哪里?APS系统或将引领智能化转型
为什么众多的ERP软件公司没有推出相关产品,当然可以肯定的是并非客户没有此观念,如果一定要说,也只能说目前的需求还不是非常强烈,从ERP厂商非常急切的与APS公司合作,甚至有高价购买APS公司代码的情 ...
- Python之路(第四十五篇)线程Event事件、 条件Condition、定时器Timer、线程queue
一.事件Event Event(事件):事件处理的机制:全局定义了一个内置标志Flag,如果Flag值为 False,那么当程序执行 event.wait方法时就会阻塞,如果Flag值为True,那么 ...
- TF-IDF算法介绍及实现
目录 1.TF-IDF算法介绍 (1)TF是词频(Term Frequency) (2) IDF是逆向文件频率(Inverse Document Frequency) (3)TF-IDF实际上是:TF ...
- linux增加swap空间的方法小结
起因及背景 近期编译AOSP(android 10.0)是总是遇到内存溢出,查了半天,无果.猜测增加下swap空间大小是否能解决,随即尝试下,果然是如此. 当然,还有其他作法,比如直接增加主机的内存( ...
- 一探istio-proxy(envoy)容器里的秘密
今天,在测试环境跑通了istio. 惭愧,是用MicroK8s跑的,其它环境,不敢呀. 基本功能都OK了. 在运行了istio-injection=enabled之后,每个pod运行时,会多一个ist ...