At Walmart.com in the U.S. and at Walmart’s 11 other websites around the world, we provide seamless shopping experience where products are sold by:

  1. Own Merchants for Walmart.com & Walmart Stores
  2. Suppliers for Online & Stores
  3. Sellers on Walmart’s marketplaces
 

Product sold on walmart.com - Online, Stores by Walmart & by 3 marketplace sellers

The Process is referred to internally as “Item Setup” and the visitors to the sites see Product listings after data processing for Products, Offers, Price,Inventory & Logistics. These entities are comprised of data from multiple sources in different formats & schemas. They have different characteristics around data processing:

  1. Products requires more of data preparation around:
  • Normalization — This is standardization of attributes & values, aids in search and discovery
  • Matching — This is a slightly complex problem to match duplicates with imperfect data
  • Classification — This involves classification against Categories & Taxonomies
  • Content — This involves scoring data quality on attributes like Title, Description, Specifications etc. , finding & filling the “gaps” through entity extraction techniques
  • Images — This involves selecting best resolution, deriving attributes, detecting watermark
  • Grouping — This involves matching, grouping products based on variations, like shoes varying on Colors & Sizes
  • Merging — This involves selection of the best sources and data aggregation from multiple sources
  • Reprocessing — The Catalog needs to be reprocessed to pickup daily changes

2. Offers are made by Multiple sellers for same products & need to checked for correctness on:

  • Identifiers
  • Price variance
  • Shipping
  • Quantity
  • Condition
  • Start & End Dates

3. Pricing & Inventory adjustments many times of the day which need to be processed with very low latency & strict time constraints

4. Logistics has a strong requirement around data correctness to optimize cost & delivery

 

Modified Original with permission from Neha Narkhede

This yields architecturally to lots of decentralized autonomous services, systems & teams which handle the data “Before & After” listing on the site. As part of redesign around 2014 we started looking into building scalable data processing systems. I was personally influenced by this famous blog post “The Log: What every software engineer should know about real-time data’s unifying abstraction” where Kafka could provide good abstraction to connect hundreds of Microservices, Teams, and evolve to company-wide multi-tenant data hub. We started modeling changes as event streams recorded in Kafka before processing. The data processing is performed using a variety of technologies like:

  1. Stream Processing using Apache StormApache Spark
  2. Plain Java Program
  3. Reactive Micro services
  4. Akka Streams

The new data pipelines which was rolled out in phases since 2015 has enabled business growth where we are on boarding sellers quicker, setting up product listings faster. Kafka is also the backbone for our New Near Real Time (NRT) Search Index, where changes are reflected on the site in seconds.

 

Message Rate filtered for a Day, split Hourly

The usage of Kafka continues to grow with new topics added everyday, we have many small clusters with hundreds of topics, processing billions of updates per day mostly driven by Pricing & Inventory adjustments. We built operational tools for tracking flows, SLA metrics, message send/receive latencies for producers and consumers, alerting on backlogs, latency and throughput. The nice thing of capturing all the updates in Kafka is that we can process the same data for Reprocessing of the catalog, sharing data between environments, A/B Testing, Analytics & Data warehouse.

The shift to Kafka enabled fast processing but has also introduced new challenges like managing many service topologies & their data dependencies, schema management for thousands of attributes, multi-DC data balancing, and shielding consumer sites from changes which may impact business.

The core tenant which drove Kafka adoption where “Item Setup” teams in different geographical locations can operate autonomously has definitely enabled agile development. I have personally witnessed this over the last couple of years since introduction. The next steps are to increase awareness of Kafka internally for New & (Re)architecting existing data processing applications, and evaluate exciting new streaming technologies like Kafka Streams and Apache Flink. We will also engage with the Kafka open source community and the surrounding ecosystem to make contributions.

Apache Kafka for Item Setup的更多相关文章

  1. Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform-part 1

    转自: http://www.confluent.io/blog/stream-data-platform-1/ These days you hear a lot about "strea ...

  2. How-to: Do Real-Time Log Analytics with Apache Kafka, Cloudera Search, and Hue

    Cloudera recently announced formal support for Apache Kafka. This simple use case illustrates how to ...

  3. 实践部署与使用apache kafka框架技术博文资料汇总

    前一篇Kafka框架设计来自英文原文(Kafka Architecture Design)的翻译及整理文章,非常有借鉴性,本文是从一个企业使用Kafka框架的角度来记录及整理的Kafka框架的技术资料 ...

  4. Apache Kafka: Next Generation Distributed Messaging System---reference

    Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...

  5. Install and Configure Apache Kafka on Ubuntu 16.04

    https://devops.profitbricks.com/tutorials/install-and-configure-apache-kafka-on-ubuntu-1604-1/ by hi ...

  6. Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)

    I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...

  7. Flafka: Apache Flume Meets Apache Kafka for Event Processing

    The new integration between Flume and Kafka offers sub-second-latency event processing without the n ...

  8. Install and Configure Apache Kafka

    I. Installation The installation environment must have JDK, verify that you enter: java -version 1. ...

  9. Apache Kafka源码分析 – Broker Server

    1. Kafka.scala 在Kafka的main入口中startup KafkaServerStartable, 而KafkaServerStartable这是对KafkaServer的封装 1: ...

随机推荐

  1. find参数exec、管道符|、xargs的区别

    1.这三个命令都可以将前面的输出做为后面的输入. 2.他们对于前面的输出,处理方式不同. find . -name "*.sh" -exec cat {} \; find . -n ...

  2. java web 学习 --第二天(Java三级考试)

    第一天的学习在这http://www.cnblogs.com/tobecrazy/p/3444474.html 2.jsp 基础知识 Jsp页面中的Java脚本主要有3部分:声明(Declaratio ...

  3. [转]AndroidStudio导出jar包

    原文链接:http://blog.csdn.net/hjq842382134/article/details/38538097# 1. 不像在Eclipse,可以直接导出jar包.AndroidStu ...

  4. shell变量判空几种方法

    强烈声明:关于对数字的比较以及判断是否为空 最好在外层添加""引起来,这样可以避免空与其他字符比较时报错的问题. 1. 变量通过" "引号引起来 #!/bin/ ...

  5. 为什么使用<!DOCTYPE HTML>

    不管是刚接触前端,还是你已经"精通"web前端开发的内容,你应该知道在你写html的时候需要定义文档类型:你知道如果没有它,浏览器在渲染页面的时候会使用怪异模式:你知道各个浏览器在 ...

  6. ocx文件转换成C#程序引用的DLL

    将ocx文件转换成C#程序引用的DLL文件的办法  将ocx文件转换成C#程序引用的DLL文件的办法,需要的朋友可以参考一下  1.打开VS2008或VS2010命令提示符(此例用VS2008) 将o ...

  7. LeetCode 453 Minimum Moves to Equal Array Elements

    Problem: Given a non-empty integer array of size n, find the minimum number of moves required to mak ...

  8. validation验证器指定action中某些方法不需要验证

    今天写代码时,遇到个问题,在一个输入数据的页面有一个按钮,单击会发出请求从数据库中取数据,在这里出现问题,单击该按钮,配置的validation起作用,该请求没有到达后台的action 点击按钮选择作 ...

  9. 25个增强iOS应用程序性能的提示和技巧(高级篇)(1)

    25个增强iOS应用程序性能的提示和技巧(高级篇)(1) 2013-04-16 14:56 破船之家 beyondvincent 字号:T | T 在开发iOS应用程序时,让程序具有良好的性能是非常关 ...

  10. [Android Pro] 告别编译运行 ---- Android Studio 2.0 Preview发布Instant Run功能

    reference to : http://www.cnblogs.com/soaringEveryday/p/4991563.html 以往的Android开发有一个头疼的且拖慢速度的问题,就是你每 ...