Apache Kafka for Item Setup

At Walmart.com in the U.S. and at Walmart’s 11 other websites around the world, we provide seamless shopping experience where products are sold by:

Own Merchants for Walmart.com & Walmart Stores
Suppliers for Online & Stores
Sellers on Walmart’s marketplaces

Product sold on walmart.com - Online, Stores by Walmart & by 3 marketplace sellers

The Process is referred to internally as “Item Setup” and the visitors to the sites see Product listings after data processing for Products, Offers, Price,Inventory & Logistics. These entities are comprised of data from multiple sources in different formats & schemas. They have different characteristics around data processing:

Products requires more of data preparation around:

Normalization — This is standardization of attributes & values, aids in search and discovery
Matching — This is a slightly complex problem to match duplicates with imperfect data
Classification — This involves classification against Categories & Taxonomies
Content — This involves scoring data quality on attributes like Title, Description, Specifications etc. , finding & filling the “gaps” through entity extraction techniques
Images — This involves selecting best resolution, deriving attributes, detecting watermark
Grouping — This involves matching, grouping products based on variations, like shoes varying on Colors & Sizes
Merging — This involves selection of the best sources and data aggregation from multiple sources
Reprocessing — The Catalog needs to be reprocessed to pickup daily changes

2. Offers are made by Multiple sellers for same products & need to checked for correctness on:

Identifiers
Price variance
Shipping
Quantity
Condition
Start & End Dates

3. Pricing & Inventory adjustments many times of the day which need to be processed with very low latency & strict time constraints

4. Logistics has a strong requirement around data correctness to optimize cost & delivery

Modified Original with permission from Neha Narkhede

This yields architecturally to lots of decentralized autonomous services, systems & teams which handle the data “Before & After” listing on the site. As part of redesign around 2014 we started looking into building scalable data processing systems. I was personally influenced by this famous blog post “The Log: What every software engineer should know about real-time data’s unifying abstraction” where Kafka could provide good abstraction to connect hundreds of Microservices, Teams, and evolve to company-wide multi-tenant data hub. We started modeling changes as event streams recorded in Kafka before processing. The data processing is performed using a variety of technologies like:

Stream Processing using Apache Storm, Apache Spark
Plain Java Program
Reactive Micro services
Akka Streams

The new data pipelines which was rolled out in phases since 2015 has enabled business growth where we are on boarding sellers quicker, setting up product listings faster. Kafka is also the backbone for our New Near Real Time (NRT) Search Index, where changes are reflected on the site in seconds.

Message Rate filtered for a Day, split Hourly

The usage of Kafka continues to grow with new topics added everyday, we have many small clusters with hundreds of topics, processing billions of updates per day mostly driven by Pricing & Inventory adjustments. We built operational tools for tracking flows, SLA metrics, message send/receive latencies for producers and consumers, alerting on backlogs, latency and throughput. The nice thing of capturing all the updates in Kafka is that we can process the same data for Reprocessing of the catalog, sharing data between environments, A/B Testing, Analytics & Data warehouse.

The shift to Kafka enabled fast processing but has also introduced new challenges like managing many service topologies & their data dependencies, schema management for thousands of attributes, multi-DC data balancing, and shielding consumer sites from changes which may impact business.

The core tenant which drove Kafka adoption where “Item Setup” teams in different geographical locations can operate autonomously has definitely enabled agile development. I have personally witnessed this over the last couple of years since introduction. The next steps are to increase awareness of Kafka internally for New & (Re)architecting existing data processing applications, and evaluate exciting new streaming technologies like Kafka Streams and Apache Flink. We will also engage with the Kafka open source community and the surrounding ecosystem to make contributions.

Apache Kafka for Item Setup的更多相关文章

Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform-part 1
转自: http://www.confluent.io/blog/stream-data-platform-1/ These days you hear a lot about "strea ...
How-to: Do Real-Time Log Analytics with Apache Kafka, Cloudera Search, and Hue
Cloudera recently announced formal support for Apache Kafka. This simple use case illustrates how to ...
实践部署与使用apache kafka框架技术博文资料汇总
前一篇Kafka框架设计来自英文原文(Kafka Architecture Design)的翻译及整理文章,非常有借鉴性,本文是从一个企业使用Kafka框架的角度来记录及整理的Kafka框架的技术资料 ...
Apache Kafka: Next Generation Distributed Messaging System---reference
Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...
Install and Configure Apache Kafka on Ubuntu 16.04
https://devops.profitbricks.com/tutorials/install-and-configure-apache-kafka-on-ubuntu-1604-1/ by hi ...
Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)
I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...
Flafka: Apache Flume Meets Apache Kafka for Event Processing
The new integration between Flume and Kafka offers sub-second-latency event processing without the n ...
Install and Configure Apache Kafka
I. Installation The installation environment must have JDK, verify that you enter: java -version 1. ...
Apache Kafka源码分析 – Broker Server
1. Kafka.scala 在Kafka的main入口中startup KafkaServerStartable, 而KafkaServerStartable这是对KafkaServer的封装 1: ...

随机推荐

eclipse中手动导入DTD文件的方式
DTD一般应用在应用程序中定义数据交换类型的文档,一般用在xml配置文件中,有些时候在eclipse中并不能加载一些提示,这个时候需要手动导入,导入方法如下: 1.首先根据声明的网址下载.dtd的文件 ...
cxLookupComboBox 控件
cxLookupComboBox cxLookupComboBox1.Properties.ListSource //显示数据源 dtsTmnList cxLookupCombo ...
java 入门第二季1
(1). 类和对象(java 语言是面向对象的) 1). 类是对象的类型具有相同的属性和方法的一组对象的集合类:属性和方法定义类: 类名属性方法 //class为关键字 2.定义类时,首字母 ...
Django~automated tests
def xx(): 冒号下一行要缩进 ATD http://blog.csdn.net/doupei2006/article/details/7657547 http://www.jb51.net/a ...
nyoj_148_fibonacci数列(二)_矩阵快速幂
fibonacci数列(二) 时间限制:1000 ms | 内存限制:65535 KB 难度:3 描述 In the Fibonacci integer sequence, F0 = 0, F ...
PUTTY的使用教程
Putty是一个优秀的,开源的SSH远程登录软件. 它不仅仅可以实现登录,还有很多高级功能. PuTTY is a free SSH, Telnet and Rlogin client for 32- ...
php继承、多态
继承: 概念:子类可以继承父类的一切特点:单继承:一个子类只能有一个父类,一个父类可以派生出多个子类方法重写:在子类里面对父类的方法进行重写. 重写:override 重载,编译多态:overlo ...
[Android Pro] Normal Permissions
As of API level 23, the following permissions are classified as PROTECTION_NORMAL: ACCESS_LOCATION_E ...
[Java 基础] 并发队列ConcurrentLinkedQueue和阻塞队列LinkedBlockingQueue用法
reference : http://www.cnblogs.com/linjiqin/archive/2013/05/30/3108188.html 在Java多线程应用中,队列的使用率很高,多数生 ...
ASP.NET SignalR 与 LayIM2.0 配合轻松实现Web聊天室（一）之基层数据搭建，让数据活起来（数据获取）
大家好,本篇是接上一篇 ASP.NET SignalR 与 LayIM2.0 配合轻松实现Web聊天室(零) 前言 ASP.NET SignalR WebIM系列第二篇.本篇会带领大家将 LayIM ...

Apache Kafka for Item Setup

Apache Kafka for Item Setup的更多相关文章

随机推荐

热门专题