Apache Kafka for Item Setup

0 0

–>

we provide seamless shopping experience where products are sold by:

Own Merchants for Walmart.com & Walmart Stores
Suppliers for Online & Stores
Sellers on Walmart’s marketplaces

Product sold on walmart.com – Online, Stores by Walmart & by 3 marketplace sellers

Logistics. These entities are comprised of data from multiple sources in different formats & schemas. They have different characteristics around data processing:

Products requires more of data preparation around:

Normalization — This is standardization of attributes & values, aids in search and discovery
Matching — This is a slightly complex problem to match duplicates with imperfect data
Classification — This involves classification against Categories & Taxonomies
Content — This involves scoring data quality on attributes like Title, Description, Specifications etc. , finding & filling the “gaps” through entity extraction techniques
Images — This involves selecting best resolution, deriving attributes, detecting watermark
Grouping — This involves matching, grouping products based on variations, like shoes varying on Colors & Sizes
Merging — This involves selection of the best sources and data aggregation from multiple sources
Reprocessing — The Catalog needs to be reprocessed to pickup daily changes

Offers are made by Multiple sellers for same products & need to checked for correctness on:

Identifiers
Price variance
Shipping
Quantity
Condition
Start & End Dates

adjustments many times of the day which need to be processed with very low latency & strict time constraints

data correctness to optimize cost & delivery

Modified Original with permission from Neha Narkhede

” where Kafka could provide good abstraction to connect hundreds of Microservices, Teams, and evolve to company-wide multi-tenant data hub. We started modeling changes as event streams recorded in Kafka before processing. The data processing is performed using a variety of technologies like:

Apache Spark
Plain Java Program
Reactive Micro services
Akka Streams

The new data pipelines which was rolled out in phases since 2015 has enabled business growth where we are on boarding sellers quicker, setting up product listings faster. Kafka is also the backbone for our New Near Real Time (NRT) Search Index, where changes are reflected on the site in seconds.

Message Rate filtered for a Day, split Hourly

The usage of Kafka continues to grow with new topics added everyday, we have many small clusters with hundreds of topics, processing billions of updates per day mostly driven by Pricing & Inventory adjustments. We built operational tools for tracking flows, SLA metrics, message send/receive latencies for producers and consumers, alerting on backlogs, latency and throughput. The nice thing of capturing all the updates in Kafka is that we can process the same data for Reprocessing of the catalog, sharing data between environments, A/B Testing, Analytics & Data warehouse.

The shift to Kafka enabled fast processing but has also introduced new challenges like managing many service topologies & their data dependencies, schema management for thousands of attributes, multi-DC data balancing, and shielding consumer sites from changes which may impact business.

Re)architecting existing data processing applications, and evaluate exciting new streaming technologies like Kafka Streams and Apache Flink. We will also engage with the Kafka open source community and the surrounding ecosystem to make contributions.

原文链接：https://www.cnblogs.com/felixzh/p/6035581.html
本文来源互联网收集，文章内容系作者个人观点，不代表本站对观点赞同或支持。如需转载，请注明文章来源,如您发现有涉嫌抄袭侵权的内容，请联系本站核实处理。