Mercari Price Suggestion

  • 最近看到了一个竞赛,竞赛的内容是根据已知的商品的描述,品牌,品类,物品的状态等特征来预测商品的价格

  • 最后的评估标准为 平均算术平方根误差Root Mean Squared Logarithmic Error.

    \[\epsilon = \sqrt { \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \left( \log \left( p _ { i } + 1 \right) - \log \left( a _ { i } + 1 \right) \right) ^ { 2 } }
    \]

  • 最后提交的文件为test_id ,price 包含两列数据,一列为测试数据中id,另一列为预测的价格

  • 训练集或者测试集中包括以下特征

    • train_id test_id 物品的编号,一个商品对应一个编号
    • name 名称
    • item_condition_id 物品状态
    • category_name 品类
    • brand_name 品牌
    • price 物品售出的价格,测试集中不包含此列,此列也为我们要预测的值
    • shipping 1 if shipping fee is paid by seller and 0 by buyer,也就是1代表包邮,0代表不包邮
    • item_description 物品的详细描述,描述中已经除去带有价格标签的值,已用[rm]代替
import pandas as pd
import numpy as np
df = pd.read_csv('input/train.tsv',sep='\t')

data information

df.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
train_id name item_condition_id category_name brand_name price shipping item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
train_id 1482535 non-null int64
name 1482535 non-null object
item_condition_id 1482535 non-null int64
category_name 1476208 non-null object
brand_name 849853 non-null object
price 1482535 non-null float64
shipping 1482535 non-null int64
item_description 1482531 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB

price distribution

df.price.describe()
count    1.482535e+06
mean 2.673752e+01
std 3.858607e+01
min 0.000000e+00
25% 1.000000e+01
50% 1.700000e+01
75% 2.900000e+01
max 2.009000e+03
Name: price, dtype: float64
import matplotlib.pyplot as plt

plt.subplot(1, 2, 1)  #  要生成一行两列,这是第一个图plt.subplot('行','列','编号')
df.price.plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)
plt.subplot(1, 2, 2)
np.log((df.price+1)).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white')
plt.xlabel('log(price+1)', fontsize=12)
plt.title('log(Price+1) Distribution', fontsize=12)
Text(0.5, 1.0, 'log(Price+1) Distribution')

  • 价格特征为左偏态,需要将其转化为正太分布的数据,价格的分布主要集中在10-20左右,而最大的价格在2009,需要将其做对数转化,转化后,其对数分布为较为规则的正态分布

包邮对于价格影响

df['shipping'].value_counts(normalize=True)
0    0.552726
1 0.447274
Name: shipping, dtype: float64
  • 对于商家是否包邮,55%的商品不包邮,44.7%的商品包邮,需要看一下包邮是否对于价格影响
shipping_yes = df.loc[df['shipping'] == 1, 'price']  # 商家出运费
shipping_no = df.loc[df['shipping'] == 0, 'price'] # 买家出运费
fig,ax  = plt.subplots(figsize=(8,5))
ax.hist(shipping_yes,color='r',alpha=0.5,bins=30,range=[0,100],label='shipping_yes')
ax.hist(shipping_no,color='green',alpha=0.5,bins=30,range=[0,100],label=
'shipping_no')
plt.xlabel('price',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('price_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

print("不包邮平均的定价%s dollars" %(round(shipping_no.mean(),2)))
print("包邮平均的定价%s dollars" %(round(shipping_yes.mean(),2)))
不包邮平均的定价30.11 dollars
包邮平均的定价22.57 dollars
fig,ax  = plt.subplots(figsize=(8,5))
ax.hist(np.log(shipping_yes+1),color='r',alpha=0.5,bins=50,label='shipping_yes')
ax.hist(np.log(shipping_no+1),color='green',alpha=0.5,bins=50,label=
'shipping_no')
plt.xlabel('log(price+1)',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('log(price+1)_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

处理category 数据

"总共的数据有{}条记录".format(df.shape[0])

'总共的数据有1482535条记录'

  • 数据集中的name,cageory,brand,item_condition_id 都需要转化为category类型的数据
df['category_name'].value_counts()
# 总共有1287类型
Women/Athletic Apparel/Pants, Tights, Leggings                 60177
Women/Tops & Blouses/T-Shirts 46380
Beauty/Makeup/Face 34335
Beauty/Makeup/Lips 29910
Electronics/Video Games & Consoles/Games 26557
Beauty/Makeup/Eyes 25215
Electronics/Cell Phones & Accessories/Cases, Covers & Skins 24676
Women/Underwear/Bras 21274
Women/Tops & Blouses/Tank, Cami 20284
Women/Tops & Blouses/Blouse 20284
Women/Dresses/Above Knee, Mini 20082
Women/Jewelry/Necklaces 19758
Women/Athletic Apparel/Shorts 19528
Beauty/Makeup/Makeup Palettes 19103
Women/Shoes/Boots 18864
Beauty/Fragrance/Women 18628
Beauty/Skin Care/Face 15836
Women/Women's Handbags/Shoulder Bag 15328
Men/Tops/T-shirts 15108
Women/Dresses/Knee-Length 14770
Women/Athletic Apparel/Shirts & Tops 14738
Women/Shoes/Sandals 14662
Women/Jewelry/Bracelets 14497
Men/Shoes/Athletic 14257
Kids/Toys/Dolls & Accessories 13957
Women/Women's Accessories/Wallets 13616
Women/Jeans/Slim, Skinny 13392
Home/Home Décor/Home Décor Accents 13004
Women/Swimwear/Two-Piece 12758
Women/Shoes/Athletic 12662
...
Men/Suits/Four Button 1
Handmade/Bags and Purses/Other 1
Handmade/Dolls and Miniatures/Primitive 1
Handmade/Furniture/Fixture 1
Handmade/Housewares/Bathroom 1
Handmade/Woodworking/Sculptures 1
Men/Suits/One Button 1
Handmade/Geekery/Housewares 1
Kids/Safety/Crib Netting 1
Vintage & Collectibles/Furniture/Entertainment 1
Home/Furniture/Bathroom Furniture 1
Handmade/Glass/Vases 1
Handmade/Geekery/Videogame 1
Handmade/Woodworking/Sports 1
Handmade/Art/Aceo 1
Vintage & Collectibles/Paper Ephemera/Map 1
Handmade/Patterns/Painting 1
Handmade/Housewares/Cleaning 1
Home/Home Décor/Doorstops 1
Handmade/Accessories/Belt 1
Handmade/Patterns/Accessories 1
Vintage & Collectibles/Housewares/Towel 1
Other/Automotive/RV Parts & Accessories 1
Handmade/Paper Goods/Pad 1
Handmade/Accessories/Cozy 1
Kids/Diapering/Washcloths & Towels 1
Handmade/Pets/Blanket 1
Handmade/Needlecraft/Clothing 1
Handmade/Furniture/Shelf 1
Handmade/Quilts/Bed 1
Name: category_name, Length: 1287, dtype: int64

it_conditon_id vs price

  • 常见的箱型图 注释
import seaborn as sns

sns.boxplot(x = 'item_condition_id', y = np.log(df['price']+1), data = df, palette = sns.color_palette('RdBu',5))

<matplotlib.axes._subplots.AxesSubplot at 0x127d5bdd8>

  • 不同的物品状态对应的价格千差外别

竞赛杀器lightgbm

  • settings
NUM_BRANDS = 4000
NUM_CATEGORIES = 1000
NAME_MIN_DF =10
MAX_FEATURES_ITEM_DESCRIPTION =50000
"There are %d items that do not have a category name" % df['category_name'].isnull().sum()

'There are 6327 items that do not have a category name'

"There are %d items that do not have a brand name" % df['brand_name'].isnull().sum()

'There are 632682 items that do not have a brand name'

"There are %d items that do not have a item_description " % df['item_description'].isnull().sum()

'There are 4 items that do not have a item_description '

def handling_missing_inplace(datasets):
datasets['category_name'].fillna('missing',inplace=True)
datasets['brand_name'].fillna('missing',inplace=True)
datasets['item_description'].replace('No description yet,''missing', inplace=True) # 需要仔细看数据才能看到
datasets['item_description'].fillna(value='missing', inplace=True)
def cutting(datasets):
pop_brand = datasets['brand_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_BRANDS]
datasets.loc[~datasets['brand_name'].isin(pop_brand),'brand_name'] ='missing'
pop_category = datasets['category_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_CATEGORIES]
datasets.loc[~datasets['category_name'].isin(pop_category),'category_name'] ='missing'
def to_category(datasets):
datasets['category_name'] = datasets['category_name'].astype('category')
datasets['brand_name'] = datasets['brand_name'].astype('category')
datasets['item_condition_id'] = datasets['item_condition_id'].astype('category')
  • 查看价格的数量分布,发现竟然有价格为0的,所以需要去掉价格为0的数据
df['price'].value_counts().reset_index().sort_values(by='index').head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
index price
25 3.0 18703
28 4.0 16139
17 5.0 31502
261 5.5 33
16 6.0 32260
df=df[df['price']!=0].reset_index(drop=True)

df.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
train_id name item_condition_id category_name brand_name price shipping item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelBinarizer
import lightgbm as lgb
from scipy.sparse import csr_matrix, hstack # 解决稀疏矩阵
# referenc https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html
import gc
import time
from sklearn.linear_model import Ridge
def main():
start_time = time.time() train = pd.read_table('input/train.tsv', engine='c')
# train=train[train['price']!=0]
test = pd.read_table('input/test_stg2.tsv', engine='c')
print('[{}] Finished to load data'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape) nrow_train = train.shape[0]
y = np.log1p(train["price"])
merge: pd.DataFrame = pd.concat([train, test])
submission: pd.DataFrame = test[['test_id']] del train
del test
gc.collect() handling_missing_inplace(merge)
print('[{}] Finished to handle missing'.format(time.time() - start_time)) cutting(merge)
print('[{}] Finished to cut'.format(time.time() - start_time)) to_category(merge)
print('[{}] Finished to convert categorical'.format(time.time() - start_time)) cv = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cv.fit_transform(merge['name'])
print('[{}] Finished count vectorize `name`'.format(time.time() - start_time)) cv = CountVectorizer()
X_category = cv.fit_transform(merge['category_name'])
print('[{}] Finished count vectorize `category_name`'.format(time.time() - start_time)) tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,
ngram_range=(1, 3),
stop_words='english')
X_description = tv.fit_transform(merge['item_description'])
print('[{}] Finished TFIDF vectorize `item_description`'.format(time.time() - start_time)) lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(merge['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time)) X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']],
sparse=True).values)
print('[{}] Finished to get dummies on `item_condition_id` and `shipping`'.format(time.time() - start_time)) sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()
print('[{}] Finished to create sparse merge'.format(time.time() - start_time)) X = sparse_merge[:nrow_train]
X_test = sparse_merge[nrow_train:] #train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.1, random_state = 144)
d_train = lgb.Dataset(X, label=y)
#d_valid = lgb.Dataset(valid_X, label=valid_y, max_bin=8192)
#watchlist = [d_train, d_valid] params = {
'learning_rate': 0.73,
'application': 'regression',
'max_depth': 3,
'num_leaves': 100,
'verbosity': -1,
'metric': 'RMSE',
} model = lgb.train(params, train_set=d_train, num_boost_round=3000, verbose_eval=100)
preds = 0.56*model.predict(X_test) model = Ridge(solver="sag", fit_intercept=True, random_state=42)
model.fit(X, y)
print('[{}] Finished to train ridge'.format(time.time() - start_time))
preds += 0.44*model.predict(X=X_test)
print('[{}] Finished to predict ridge'.format(time.time() - start_time)) submission['price'] = np.expm1(preds)
submission.loc[submission['price'] < 0.0, 'price'] = 0.0
submission.to_csv("sample_submission_stg2.csv", index=False)
if __name__ == '__main__':
main()

Mercari Price Suggestion in Kaggle的更多相关文章

  1. 使用Pandas: str.replace() 进行文本清洗

    前段时间参加了Kaggle上的Mercari Price Suggestion Challenge比赛,收获良多,过些时候准备进行一些全面的总结,本篇文章先谈一个比赛中用到的小技巧. 这个比赛数据中有 ...

  2. stacking method house price in kaggle top10%

    整合几部分代码的汇总 隐藏代码片段 导入python数据和可视化包 导入统计相关的工具 导入回归相关的算法 导入数据预处理相关的方法 导入模型调参相关的包 读取数据 特征工程 缺失值 类别特征处理-l ...

  3. kaggle house price

    kaggle 竞赛入门 导入常用的数据分析以及模型的库 数据处理 Data fields 去除异常值 处理缺失值 分析 Utilities Exploratory Data Analysis Corr ...

  4. Feature Preprocessing on Kaggle

    刚入手data science, 想着自己玩一玩kaggle,玩了新手Titanic和House Price的 项目, 觉得基本的baseline还是可以写出来,但是具体到一些细节,以至于到能拿到的出 ...

  5. how to update product listing price sale price and sale date using mobile App

    Greetings from Amazon Seller Support, Thank you for writing back to us. I have reviewed our previous ...

  6. kaggle 欺诈信用卡预测——不平衡训练样本的处理方法 综合结论就是:随机森林+过采样(直接复制或者smote后,黑白比例1:3 or 1:1)效果比较好!记得在smote前一定要先做标准化!!!其实随机森林对特征是否标准化无感,但是svm和LR就非常非常关键了

    先看数据: 特征如下: Time Number of seconds elapsed between each transaction (over two days) numeric V1 No de ...

  7. kaggle——分销商产品未来销售情况预测

    分销商产品未来销售情况预测 介绍 前面的几个实验中,都是根据提供的数据特征来构建模型,也就是说,数据集中会含有许多的特征列.本次将会介绍如何去处理另一种常见的数据,即时间序列数据.具体来说就是如何根据 ...

  8. Kaggle竞赛入门:决策树算法的Python实现

    本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...

  9. Kaggle竞赛入门(二):如何验证机器学习模型

    本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...

随机推荐

  1. python 练习题:将列表中的大写字母转换成小写

    将列表中的大写字母转换成小写如果list中既包含字符串,又包含整数,由于非字符串类型没有lower()方法,L1 = ['Hello', 'World', 18, 'Apple', None]请修改列 ...

  2. Message "'OFFSET' 附近有语法错误。\r\n在 FETCH 语句中选项 NEXT 的用法无效。" 解决办法 EntityFrameworkCore

    由于新版的EntityFrameworkCore默认使用的是SqlServer2012或以上版本的Sql语法分页,来提高性能. 所以使用数据库的版本如果低于2012(如Sqlserver2008)需要 ...

  3. PIE 插件式开发小笔记__PIESDK学习体会

    基于PIE.NET-SDK插件式二次开发文档笔记:  PIE 插件式开发配置文件: 它里面一行如下:      理解上一行'Item'关系->    library:为插件类名(程序集名称+后缀 ...

  4. ElasticSearch之安装及基本操作API

    ElasticSearch 是目前非常流行的搜索引擎,对海量数据搜索是非常友好,并且在高并发场景下,也能发挥出稳定,快速特点.也是大数据和索搜服务的开发人员所极力追捧的中间件.虽然 ElasticSe ...

  5. 关于spring中请求返回值的json序列化/反序列化问题

    https://cloud.tencent.com/developer/article/1381083 https://www.jianshu.com/p/db07543ffe0a 先留个坑

  6. 中国工业的下一个十年在哪里?APS系统或将引领智能化转型

    为什么众多的ERP软件公司没有推出相关产品,当然可以肯定的是并非客户没有此观念,如果一定要说,也只能说目前的需求还不是非常强烈,从ERP厂商非常急切的与APS公司合作,甚至有高价购买APS公司代码的情 ...

  7. Python之路(第四十五篇)线程Event事件、 条件Condition、定时器Timer、线程queue

    一.事件Event Event(事件):事件处理的机制:全局定义了一个内置标志Flag,如果Flag值为 False,那么当程序执行 event.wait方法时就会阻塞,如果Flag值为True,那么 ...

  8. TF-IDF算法介绍及实现

    目录 1.TF-IDF算法介绍 (1)TF是词频(Term Frequency) (2) IDF是逆向文件频率(Inverse Document Frequency) (3)TF-IDF实际上是:TF ...

  9. linux增加swap空间的方法小结

    起因及背景 近期编译AOSP(android 10.0)是总是遇到内存溢出,查了半天,无果.猜测增加下swap空间大小是否能解决,随即尝试下,果然是如此. 当然,还有其他作法,比如直接增加主机的内存( ...

  10. 一探istio-proxy(envoy)容器里的秘密

    今天,在测试环境跑通了istio. 惭愧,是用MicroK8s跑的,其它环境,不敢呀. 基本功能都OK了. 在运行了istio-injection=enabled之后,每个pod运行时,会多一个ist ...