Mercari Price Suggestion in Kaggle

Mercari Price Suggestion

最近看到了一个竞赛，竞赛的内容是根据已知的商品的描述，品牌，品类，物品的状态等特征来预测商品的价格
最后的评估标准为平均算术平方根误差Root Mean Squared Logarithmic Error.

\[\epsilon = \sqrt { \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \left( \log \left( p _ { i } + 1 \right) - \log \left( a _ { i } + 1 \right) \right) ^ { 2 } }
\]
最后提交的文件为test_id ,price 包含两列数据，一列为测试数据中id，另一列为预测的价格
训练集或者测试集中包括以下特征
- train_id test_id 物品的编号，一个商品对应一个编号
- name 名称
- item_condition_id 物品状态
- category_name 品类
- brand_name 品牌
- price 物品售出的价格，测试集中不包含此列，此列也为我们要预测的值
- shipping 1 if shipping fee is paid by seller and 0 by buyer,也就是1代表包邮，0代表不包邮
- item_description 物品的详细描述，描述中已经除去带有价格标签的值，已用[rm]代替

import pandas as pd

import numpy as np

df = pd.read_csv('input/train.tsv',sep='\t')

data information

df.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.0	1	No description yet
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.0	1	New with tags. Leather horses. Retail for [rm]...
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.0	0	Complete with certificate of authenticity

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1482535 entries, 0 to 1482534

Data columns (total 8 columns):

train_id             1482535 non-null int64

name                 1482535 non-null object

item_condition_id    1482535 non-null int64

category_name        1476208 non-null object

brand_name           849853 non-null object

price                1482535 non-null float64

shipping             1482535 non-null int64

item_description     1482531 non-null object

dtypes: float64(1), int64(3), object(4)

memory usage: 90.5+ MB

price distribution

df.price.describe()

count    1.482535e+06

mean     2.673752e+01

std      3.858607e+01

min      0.000000e+00

25%      1.000000e+01

50%      1.700000e+01

75%      2.900000e+01

max      2.009000e+03

Name: price, dtype: float64

import matplotlib.pyplot as plt

plt.subplot(1, 2, 1)  #  要生成一行两列，这是第一个图plt.subplot('行','列','编号')

df.price.plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])

plt.xlabel('price', fontsize=12)

plt.title('Price Distribution', fontsize=12)

plt.subplot(1, 2, 2)

np.log((df.price+1)).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white')

plt.xlabel('log(price+1)', fontsize=12)

plt.title('log(Price+1) Distribution', fontsize=12)

Text(0.5, 1.0, 'log(Price+1) Distribution')

价格特征为左偏态，需要将其转化为正太分布的数据，价格的分布主要集中在10-20左右，而最大的价格在2009，需要将其做对数转化，转化后，其对数分布为较为规则的正态分布

包邮对于价格影响

df['shipping'].value_counts(normalize=True)

0    0.552726

1    0.447274

Name: shipping, dtype: float64

对于商家是否包邮，55%的商品不包邮，44.7%的商品包邮，需要看一下包邮是否对于价格影响

shipping_yes = df.loc[df['shipping'] == 1, 'price']  # 商家出运费

shipping_no = df.loc[df['shipping'] == 0, 'price']  #  买家出运费

fig,ax  = plt.subplots(figsize=(8,5))

ax.hist(shipping_yes,color='r',alpha=0.5,bins=30,range=[0,100],label='shipping_yes')

ax.hist(shipping_no,color='green',alpha=0.5,bins=30,range=[0,100],label=

       'shipping_no')

plt.xlabel('price',fontsize=12)

plt.ylabel('frequency',fontsize=12)

plt.title('price_distribution by shipping method')

plt.tick_params(labelsize=12)

plt.legend()

plt.show()

print("不包邮平均的定价%s dollars" %(round(shipping_no.mean(),2)))

print("包邮平均的定价%s dollars" %(round(shipping_yes.mean(),2)))

不包邮平均的定价30.11 dollars

包邮平均的定价22.57 dollars

fig,ax  = plt.subplots(figsize=(8,5))

ax.hist(np.log(shipping_yes+1),color='r',alpha=0.5,bins=50,label='shipping_yes')

ax.hist(np.log(shipping_no+1),color='green',alpha=0.5,bins=50,label=

       'shipping_no')

plt.xlabel('log(price+1)',fontsize=12)

plt.ylabel('frequency',fontsize=12)

plt.title('log(price+1)_distribution by shipping method')

plt.tick_params(labelsize=12)

plt.legend()

plt.show()

处理category 数据

"总共的数据有{}条记录".format(df.shape[0])

'总共的数据有1482535条记录'

数据集中的name,cageory,brand,item_condition_id 都需要转化为category类型的数据

df['category_name'].value_counts()

#  总共有1287类型

Women/Athletic Apparel/Pants, Tights, Leggings                 60177

Women/Tops & Blouses/T-Shirts                                  46380

Beauty/Makeup/Face                                             34335

Beauty/Makeup/Lips                                             29910

Electronics/Video Games & Consoles/Games                       26557

Beauty/Makeup/Eyes                                             25215

Electronics/Cell Phones & Accessories/Cases, Covers & Skins    24676

Women/Underwear/Bras                                           21274

Women/Tops & Blouses/Tank, Cami                                20284

Women/Tops & Blouses/Blouse                                    20284

Women/Dresses/Above Knee, Mini                                 20082

Women/Jewelry/Necklaces                                        19758

Women/Athletic Apparel/Shorts                                  19528

Beauty/Makeup/Makeup Palettes                                  19103

Women/Shoes/Boots                                              18864

Beauty/Fragrance/Women                                         18628

Beauty/Skin Care/Face                                          15836

Women/Women's Handbags/Shoulder Bag                            15328

Men/Tops/T-shirts                                              15108

Women/Dresses/Knee-Length                                      14770

Women/Athletic Apparel/Shirts & Tops                           14738

Women/Shoes/Sandals                                            14662

Women/Jewelry/Bracelets                                        14497

Men/Shoes/Athletic                                             14257

Kids/Toys/Dolls & Accessories                                  13957

Women/Women's Accessories/Wallets                              13616

Women/Jeans/Slim, Skinny                                       13392

Home/Home Décor/Home Décor Accents                             13004

Women/Swimwear/Two-Piece                                       12758

Women/Shoes/Athletic                                           12662

                                                               ...

Men/Suits/Four Button                                              1

Handmade/Bags and Purses/Other                                     1

Handmade/Dolls and Miniatures/Primitive                            1

Handmade/Furniture/Fixture                                         1

Handmade/Housewares/Bathroom                                       1

Handmade/Woodworking/Sculptures                                    1

Men/Suits/One Button                                               1

Handmade/Geekery/Housewares                                        1

Kids/Safety/Crib Netting                                           1

Vintage & Collectibles/Furniture/Entertainment                     1

Home/Furniture/Bathroom Furniture                                  1

Handmade/Glass/Vases                                               1

Handmade/Geekery/Videogame                                         1

Handmade/Woodworking/Sports                                        1

Handmade/Art/Aceo                                                  1

Vintage & Collectibles/Paper Ephemera/Map                          1

Handmade/Patterns/Painting                                         1

Handmade/Housewares/Cleaning                                       1

Home/Home Décor/Doorstops                                          1

Handmade/Accessories/Belt                                          1

Handmade/Patterns/Accessories                                      1

Vintage & Collectibles/Housewares/Towel                            1

Other/Automotive/RV Parts & Accessories                            1

Handmade/Paper Goods/Pad                                           1

Handmade/Accessories/Cozy                                          1

Kids/Diapering/Washcloths & Towels                                 1

Handmade/Pets/Blanket                                              1

Handmade/Needlecraft/Clothing                                      1

Handmade/Furniture/Shelf                                           1

Handmade/Quilts/Bed                                                1

Name: category_name, Length: 1287, dtype: int64

it_conditon_id vs price

常见的箱型图注释

import seaborn as sns

sns.boxplot(x = 'item_condition_id', y = np.log(df['price']+1), data = df, palette = sns.color_palette('RdBu',5))

<matplotlib.axes._subplots.AxesSubplot at 0x127d5bdd8>

不同的物品状态对应的价格千差外别

竞赛杀器lightgbm

settings

NUM_BRANDS = 4000

NUM_CATEGORIES = 1000

NAME_MIN_DF =10

MAX_FEATURES_ITEM_DESCRIPTION =50000

"There are %d items that do not have a category name" % df['category_name'].isnull().sum()

'There are 6327 items that do not have a category name'

"There are %d items that do not have a brand name" % df['brand_name'].isnull().sum()

'There are 632682 items that do not have a brand name'

"There are %d items that do not have a item_description " % df['item_description'].isnull().sum()

'There are 4 items that do not have a item_description '

def handling_missing_inplace(datasets):

    datasets['category_name'].fillna('missing',inplace=True)

    datasets['brand_name'].fillna('missing',inplace=True)

    datasets['item_description'].replace('No description yet,''missing', inplace=True) # 需要仔细看数据才能看到

    datasets['item_description'].fillna(value='missing', inplace=True)

def cutting(datasets):

    pop_brand = datasets['brand_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_BRANDS]

    datasets.loc[~datasets['brand_name'].isin(pop_brand),'brand_name'] ='missing'

    pop_category = datasets['category_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_CATEGORIES]

    datasets.loc[~datasets['category_name'].isin(pop_category),'category_name'] ='missing'

def to_category(datasets):

    datasets['category_name'] = datasets['category_name'].astype('category')

    datasets['brand_name'] = datasets['brand_name'].astype('category')

    datasets['item_condition_id'] = datasets['item_condition_id'].astype('category')

查看价格的数量分布，发现竟然有价格为0的，所以需要去掉价格为0的数据

df['price'].value_counts().reset_index().sort_values(by='index').head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	index	price
25	3.0	18703
28	4.0	16139
17	5.0	31502
261	5.5	33
16	6.0	32260

df=df[df['price']!=0].reset_index(drop=True)

df.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.0	1	No description yet
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.0	1	New with tags. Leather horses. Retail for [rm]...
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.0	0	Complete with certificate of authenticity

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.preprocessing import LabelBinarizer

import lightgbm as lgb

from scipy.sparse import csr_matrix, hstack  # 解决稀疏矩阵

# referenc https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html

import gc

import time

from sklearn.linear_model import Ridge

def main():

    start_time = time.time()

    train = pd.read_table('input/train.tsv', engine='c')

    # train=train[train['price']!=0]

    test = pd.read_table('input/test_stg2.tsv', engine='c')

    print('[{}] Finished to load data'.format(time.time() - start_time))

    print('Train shape: ', train.shape)

    print('Test shape: ', test.shape)

    nrow_train = train.shape[0]

    y = np.log1p(train["price"])

    merge: pd.DataFrame = pd.concat([train, test])

    submission: pd.DataFrame = test[['test_id']]

    del train

    del test

    gc.collect()

    handling_missing_inplace(merge)

    print('[{}] Finished to handle missing'.format(time.time() - start_time))

    cutting(merge)

    print('[{}] Finished to cut'.format(time.time() - start_time))

    to_category(merge)

    print('[{}] Finished to convert categorical'.format(time.time() - start_time))

    cv = CountVectorizer(min_df=NAME_MIN_DF)

    X_name = cv.fit_transform(merge['name'])

    print('[{}] Finished count vectorize `name`'.format(time.time() - start_time))

    cv = CountVectorizer()

    X_category = cv.fit_transform(merge['category_name'])

    print('[{}] Finished count vectorize `category_name`'.format(time.time() - start_time))

    tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,

                         ngram_range=(1, 3),

                         stop_words='english')

    X_description = tv.fit_transform(merge['item_description'])

    print('[{}] Finished TFIDF vectorize `item_description`'.format(time.time() - start_time))

    lb = LabelBinarizer(sparse_output=True)

    X_brand = lb.fit_transform(merge['brand_name'])

    print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))

    X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']],

                                          sparse=True).values)

    print('[{}] Finished to get dummies on `item_condition_id` and `shipping`'.format(time.time() - start_time))

    sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()

    print('[{}] Finished to create sparse merge'.format(time.time() - start_time))

    X = sparse_merge[:nrow_train]

    X_test = sparse_merge[nrow_train:]

    #train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.1, random_state = 144)

    d_train = lgb.Dataset(X, label=y)

    #d_valid = lgb.Dataset(valid_X, label=valid_y, max_bin=8192)

    #watchlist = [d_train, d_valid]

    params = {

        'learning_rate': 0.73,

        'application': 'regression',

        'max_depth': 3,

        'num_leaves': 100,

        'verbosity': -1,

        'metric': 'RMSE',

    }

    model = lgb.train(params, train_set=d_train, num_boost_round=3000, verbose_eval=100)

    preds = 0.56*model.predict(X_test)

    model = Ridge(solver="sag", fit_intercept=True, random_state=42)

    model.fit(X, y)

    print('[{}] Finished to train ridge'.format(time.time() - start_time))

    preds += 0.44*model.predict(X=X_test)

    print('[{}] Finished to predict ridge'.format(time.time() - start_time))

    submission['price'] = np.expm1(preds)

    submission.loc[submission['price'] < 0.0, 'price'] = 0.0

    submission.to_csv("sample_submission_stg2.csv", index=False)

if __name__ == '__main__':

    main()

Mercari Price Suggestion in Kaggle的更多相关文章

使用Pandas: str.replace() 进行文本清洗
前段时间参加了Kaggle上的Mercari Price Suggestion Challenge比赛,收获良多,过些时候准备进行一些全面的总结,本篇文章先谈一个比赛中用到的小技巧. 这个比赛数据中有 ...
stacking method house price in kaggle top10%
整合几部分代码的汇总隐藏代码片段导入python数据和可视化包导入统计相关的工具导入回归相关的算法导入数据预处理相关的方法导入模型调参相关的包读取数据特征工程缺失值类别特征处理-l ...
kaggle house price
kaggle 竞赛入门导入常用的数据分析以及模型的库数据处理 Data fields 去除异常值处理缺失值分析 Utilities Exploratory Data Analysis Corr ...
Feature Preprocessing on Kaggle
刚入手data science, 想着自己玩一玩kaggle,玩了新手Titanic和House Price的项目, 觉得基本的baseline还是可以写出来,但是具体到一些细节,以至于到能拿到的出 ...
how to update product listing price sale price and sale date using mobile App
Greetings from Amazon Seller Support, Thank you for writing back to us. I have reviewed our previous ...
kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！其实随机森林对特征是否标准化无感，但是svm和LR就非常非常关键了
先看数据: 特征如下: Time Number of seconds elapsed between each transaction (over two days) numeric V1 No de ...
kaggle——分销商产品未来销售情况预测
分销商产品未来销售情况预测介绍前面的几个实验中,都是根据提供的数据特征来构建模型,也就是说,数据集中会含有许多的特征列.本次将会介绍如何去处理另一种常见的数据,即时间序列数据.具体来说就是如何根据 ...
Kaggle竞赛入门：决策树算法的Python实现
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...
Kaggle竞赛入门（二）：如何验证机器学习模型
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...

随机推荐

TCP/UDP协议（二）
面试问题:Tcp/Udp协议是什么,各有什么异同点,各自的使用场景? Tcp协议(传输控制协议) tcp是面向连接的协议,在收发数据之前,必须与对方建立可靠的连接: 三次握手:简单形象通俗描述: 主机 ...
Common Lisp : Symbol到底是什么
主要参考: <Emacs之魂>(四) 包:
RFID相关知识总结(超高频UHF)
RFID标签分类 1.LF(Low frequency) 低频频段范围: 125 KHz-135KHz(ISO18000-2) 常见应用:该频段特点是具有良好的物体穿透能力.广泛应用于进出管理.门禁 ...
[快捷键的使用] IntelliJ IDEA 将数据库里面的表转化为对象
本文讲述IntelliJ IDEA 多行编辑快捷键的使用,希望能帮到新人提高效率. 注意:在笔记本键盘上操作的方法. 数据库连接工具使用SQLyog 第一步: 从数据里面将文本拷贝到User类里面. ...
Django2.0版本以上与pymsql 不匹配问题以及解决方法
Django2.0版本以上与pymsql 不匹配问题以及解决方法 Django 2.0 以上如果使用pymysql0.93,需要一下两步操作: # 1 第一次报错信息: File "D:\ ...
ElasticSearch(九)e代驾使用Elasticsearch流程设计（Yii1版本）
一.控制器层的更新.添加.删除 class AddKnowledgeAction extends CAction { //add and update public function actionPo ...
SpringApplication到底run了什么（下）
在上篇文章中SpringApplication到底run了什么(上)中,我们分析了下面这个run方法的前半部分,本篇文章继续开工 public ConfigurableApplicationConte ...
vue接入腾讯防水墙代码
vue接入腾讯防水墙代码开始创建代码: 登陆调用方法代码
idea中Entity实体中报错：cannot resolve column/table/...解决办法。
idea中Entity实体中报错:cannot resolve column/table/...解决办法. 若idea中Entity实体中报错: cannot resolve column.... c ...
关于C++中使用++it还是it++的问题
我们经常使用for循环来遍历东西,循环变量可以前自增也可以后自增,发现对遍历结果没啥影响,但是该如何选择呢? 我们应该尽量使用前自增运算符而不是后自增运算符,即用 ++ Iter 代替 Iter++ ...

Mercari Price Suggestion in Kaggle

Mercari Price Suggestion

data information

price distribution

包邮对于价格影响

处理category 数据

it_conditon_id vs price

竞赛杀器lightgbm

Mercari Price Suggestion in Kaggle的更多相关文章

随机推荐

热门专题