At Walmart.com in the U.S. and at Walmart’s 11 other websites around the world, we provide seamless shopping experience where products are sold by:

  1. Own Merchants for Walmart.com & Walmart Stores
  2. Suppliers for Online & Stores
  3. Sellers on Walmart’s marketplaces
 

Product sold on walmart.com - Online, Stores by Walmart & by 3 marketplace sellers

The Process is referred to internally as “Item Setup” and the visitors to the sites see Product listings after data processing for Products, Offers, Price,Inventory & Logistics. These entities are comprised of data from multiple sources in different formats & schemas. They have different characteristics around data processing:

  1. Products requires more of data preparation around:
  • Normalization — This is standardization of attributes & values, aids in search and discovery
  • Matching — This is a slightly complex problem to match duplicates with imperfect data
  • Classification — This involves classification against Categories & Taxonomies
  • Content — This involves scoring data quality on attributes like Title, Description, Specifications etc. , finding & filling the “gaps” through entity extraction techniques
  • Images — This involves selecting best resolution, deriving attributes, detecting watermark
  • Grouping — This involves matching, grouping products based on variations, like shoes varying on Colors & Sizes
  • Merging — This involves selection of the best sources and data aggregation from multiple sources
  • Reprocessing — The Catalog needs to be reprocessed to pickup daily changes

2. Offers are made by Multiple sellers for same products & need to checked for correctness on:

  • Identifiers
  • Price variance
  • Shipping
  • Quantity
  • Condition
  • Start & End Dates

3. Pricing & Inventory adjustments many times of the day which need to be processed with very low latency & strict time constraints

4. Logistics has a strong requirement around data correctness to optimize cost & delivery

 

Modified Original with permission from Neha Narkhede

This yields architecturally to lots of decentralized autonomous services, systems & teams which handle the data “Before & After” listing on the site. As part of redesign around 2014 we started looking into building scalable data processing systems. I was personally influenced by this famous blog post “The Log: What every software engineer should know about real-time data’s unifying abstraction” where Kafka could provide good abstraction to connect hundreds of Microservices, Teams, and evolve to company-wide multi-tenant data hub. We started modeling changes as event streams recorded in Kafka before processing. The data processing is performed using a variety of technologies like:

  1. Stream Processing using Apache StormApache Spark
  2. Plain Java Program
  3. Reactive Micro services
  4. Akka Streams

The new data pipelines which was rolled out in phases since 2015 has enabled business growth where we are on boarding sellers quicker, setting up product listings faster. Kafka is also the backbone for our New Near Real Time (NRT) Search Index, where changes are reflected on the site in seconds.

 

Message Rate filtered for a Day, split Hourly

The usage of Kafka continues to grow with new topics added everyday, we have many small clusters with hundreds of topics, processing billions of updates per day mostly driven by Pricing & Inventory adjustments. We built operational tools for tracking flows, SLA metrics, message send/receive latencies for producers and consumers, alerting on backlogs, latency and throughput. The nice thing of capturing all the updates in Kafka is that we can process the same data for Reprocessing of the catalog, sharing data between environments, A/B Testing, Analytics & Data warehouse.

The shift to Kafka enabled fast processing but has also introduced new challenges like managing many service topologies & their data dependencies, schema management for thousands of attributes, multi-DC data balancing, and shielding consumer sites from changes which may impact business.

The core tenant which drove Kafka adoption where “Item Setup” teams in different geographical locations can operate autonomously has definitely enabled agile development. I have personally witnessed this over the last couple of years since introduction. The next steps are to increase awareness of Kafka internally for New & (Re)architecting existing data processing applications, and evaluate exciting new streaming technologies like Kafka Streams and Apache Flink. We will also engage with the Kafka open source community and the surrounding ecosystem to make contributions.

Apache Kafka for Item Setup的更多相关文章

  1. Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform-part 1

    转自: http://www.confluent.io/blog/stream-data-platform-1/ These days you hear a lot about "strea ...

  2. How-to: Do Real-Time Log Analytics with Apache Kafka, Cloudera Search, and Hue

    Cloudera recently announced formal support for Apache Kafka. This simple use case illustrates how to ...

  3. 实践部署与使用apache kafka框架技术博文资料汇总

    前一篇Kafka框架设计来自英文原文(Kafka Architecture Design)的翻译及整理文章,非常有借鉴性,本文是从一个企业使用Kafka框架的角度来记录及整理的Kafka框架的技术资料 ...

  4. Apache Kafka: Next Generation Distributed Messaging System---reference

    Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...

  5. Install and Configure Apache Kafka on Ubuntu 16.04

    https://devops.profitbricks.com/tutorials/install-and-configure-apache-kafka-on-ubuntu-1604-1/ by hi ...

  6. Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)

    I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...

  7. Flafka: Apache Flume Meets Apache Kafka for Event Processing

    The new integration between Flume and Kafka offers sub-second-latency event processing without the n ...

  8. Install and Configure Apache Kafka

    I. Installation The installation environment must have JDK, verify that you enter: java -version 1. ...

  9. Apache Kafka源码分析 – Broker Server

    1. Kafka.scala 在Kafka的main入口中startup KafkaServerStartable, 而KafkaServerStartable这是对KafkaServer的封装 1: ...

随机推荐

  1. FFmpeg 官方 20160227 之后 追加 libmfx 无法在 xp 上运行的解决方法

    修改三个地方 _wfopen_s _wfopen strncpy_s strncpy swscanf_s swscanf 下载 fixffmpeg.7z, fixff.cmd FixFFmpeg.ex ...

  2. Java for LeetCode 219 Contains Duplicate II

    Given an array of integers and an integer k, find out whether there there are two distinct indices i ...

  3. GCD 大中枢派发 简单应用实例

    @interface ViewController () { UIImageView* iv; UIButton* btn; UILabel* lbl; } @end @implementation ...

  4. Solr安装过程

    Solr安装过程 下载相关资料 solr 4.2.0 http://lucene.apache.org/solr/ 期间安装过 solr 4.3.0 很可惜没有配置成功 apache-tomcat-7 ...

  5. 实现Windows Phone 8中ListBox的分页加载

    功能就是ListBox滚动到最下方的时候,能够自动加载下一页的内容. 解决问题的关键就是如何判断ListBox已经加载到了最底部. 网上找了两个解决方法: 1 http://googlers.itey ...

  6. poj 3735 Training little cats 矩阵快速幂+稀疏矩阵乘法优化

    题目链接 题意:有n个猫,开始的时候每个猫都没有坚果,进行k次操作,g x表示给第x个猫一个坚果,e x表示第x个猫吃掉所有坚果,s x y表示第x个猫和第y个猫交换所有坚果,将k次操作重复进行m轮, ...

  7. IOS-ARC和垃圾回收机制

    ARC是编译层面的东西,垃圾回收是程序运行以后的机制,两者不可混为一谈 苹果觉得垃圾回收这种严重影响电源使用效率的特性,同移动设备天生的实时性是相冲突的.但是在iOS 5当中苹果引入了自动内存管理机制 ...

  8. IIS 发布mvc 403.14

    转载: iis7 发布mvc3 遇到的HTTP错误 403.14-Forbidden Web 服务器被配置为不列出此目录的内容及Login on failed for “IIS APPPOOL\ASP ...

  9. [Android] 解析android framework下利用app_process来调用java写的命令及示例

    reference to :http://bbs.9ria.com/thread-253058-1-1.html 在android SDK的framework/base/cmds目录下了,有不少目录, ...

  10. 解决客户端访问https报错

    现象: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure at com.sun.net.ssl. ...