Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service

, multimodal recommender system will not be trivial particularly when it must scale, adapt in close to actual time, and run reliably on cloud.

On this publish, I stroll by my expertise designing and deploying such a system finish‑to‑finish masking information preparation, mannequin coaching to serving the fashions in manufacturing.

We’ll discover the total pipeline together with retrieval, filtering, scoring, and rating together with the infrastructure and vital choices that makes all of it work. This consists of characteristic shops, Bloom‑filters, Kubeflow, close to actual‑time choice adaptation, and a significant latency win from in‑reminiscence characteristic caching.

It’s a protracted learn, however in case you’re constructing or scaling recommender methods, you’ll discover sensible patterns right here that you could apply on to your personal initiatives.

The primary sections of this publish

Some details about the system
Why the present design was chosen
System elements
Knowledge supply
Full Coaching and Deployment pipeline
Continuous fine-tuning pipeline
Processing requests by the 14 fashions in NVIDIA Triton Inference server
Enhancing merchandise characteristic lookup latency with in-memory caching
Autoscaling the Triton Inference Server on EKS
Validating contextual suggestions, Bloom filter filtering, and close to real-time suggestion updates (with Demo)
Limitations and Future Work
Conclusion
Assets

Some details about the system

The recommender system consists of 4 important levels: a Two-Tower mannequin generates candidates, a Bloom filter quickly hides gadgets the consumer lately interacted with, a DLRM ranker scores the remaining gadgets utilizing consumer, merchandise, and context options, and a remaining reranking stage orders and samples from these scores to provide the ultimate suggestions. The fashions use each tabular collaborative options and precomputed CLIP picture embeddings and Sentence-BERT textual content embeddings.

Within the retrieval mannequin, these pretrained embeddings are fed into the candidate tower along with discovered merchandise options, offering the candidate tower with each content-based semantic alerts and collaborative alerts. The dot product between the query-tower output and candidate-tower output is then used as a discovered relevance rating on this shared embedding area.

Within the DLRM ranker, the pretrained picture and textual content embeddings take part within the dot-product interplay layer. These pairwise interactions are then handed to the highest MLP, permitting content-based alerts from the pretrained embeddings to enrich the collaborative and contextual alerts used for click on prediction.

Why the present design was chosen

The goal use case is an ecommerce platform that should suggest related merchandise as quickly as customers land on the homepage. The platform serves each registered customers and nameless guests, and consumer conduct can range considerably with the request context, akin to machine sort, time of day, or day of week. Which means the advice service should present affordable cold-start suggestions for brand new customers and should adapt suggestions to the context of the present request.

The answer additionally must scale. As extra retailers are onboarded, the product catalog may develop to tens of millions of things. At that time, scoring the total catalog on each request is impractical. A multistage design solves this drawback by utilizing a lightweight weight retrieval stage to fetch candidates shortly and a heavier rating stage to attain these candidates.

Additionally, the advice fashions want to remain updated with new interactions, nevertheless rebuilding the total retrieval stack on daily basis will not be sensible. Because of this, two Kubeflow pipelines are outlined. The primary pipeline units up the preprocessing workflows, trains the fashions from scratch, builds the ANN index, and deploys the Triton server and fashions. The second pipeline manages every day finetuning which primarily updates the question tower and the ranker; the fashions are up to date with new interplay alerts however the merchandise embeddings aren’t regenerated.

System elements

All elements of the system work collectively to make sure the general objective of serving related suggestions quick and at affordable scale is achieved.

Kubeflow Pipelines manages each the total coaching workflow and the every day fine-tuning workflow on the Kubernetes-based system.
The NVIDIA Merlin stack handles GPU-accelerated characteristic engineering, preprocessing, coaching retrieval and rating fashions. Triton Inference server hosts the multistage serving graph as a single ensemble mannequin.
FAISS serves because the approximate nearest neighbor index for candidate retrieval.
Feast manages the consumer and merchandise options throughout coaching and serving. ElastiCache for Valkey (Redis) backs the web characteristic retailer, manages every consumer’s Bloom filter to permit filtering of already-seen gadgets from a consumer’s suggestion checklist, and shops world and category-based merchandise reputation info primarily based on interplay counts. Amazon Athena (with S3 and Glue) backs the offline characteristic retailer.
Amazon Elastic Kubernetes Service (EKS) runs the containerized machine studying workflows and scales compute to fulfill altering workload calls for.

Determine 2: Recommender system MLOps with Kubeflow on Amazon Elastic Kubernetes Service (picture by creator)

Knowledge supply

The coaching information comes from a modified model of the AWS Retail Demo Store interaction generator. The consumer pool was scaled to 300,000 whereas the product catalog was stored at 2,465 gadgets, with the related photographs and descriptions. The dataset accommodates 13 million interactions throughout 14 days, saved as every day partitioned parquets (day_00.parquet — day_13.parquet).

Full Coaching and Deployment pipeline

The primary Kubeflow pipeline handles the preliminary information copy, information preprocessing, mannequin coaching, FAISS indexing, and Triton Inference Server deployment.

Figure 3: Kubeflow UI showing the components of the full Training and deployment pipeline (by Author) — *Determine 3: Kubeflow UI displaying the elements of the total Coaching and deployment pipeline* (picture by creator)

Knowledge copy

The pipeline begins by copying all of the inputs wanted by downstream duties from S3 bucket to a persistent quantity mounted at an area path. These embody the interplay information, characteristic tables, product photographs, pretrained CLIP and Sentence-BERT fashions.

Preprocessing

The preprocessing step merges interplay information with consumer and merchandise characteristic tables, then defines and matches three NVTabular workflows, one for the consumer options [jump to CODE], one for the merchandise options [ jump to CODE] , and one for the context options [jump to CODE]. It additionally compiles the subgraphs right into a full workflow. Splitting the workflows made it simpler to construct separate triton fashions for characteristic transformations which might be independently up to date.

One other preprocessing step simulates cold-start situations (see code snippet beneath) throughout coaching. In 5% of coaching rows, the consumer ID, gender, and top_category options are changed with sentinel values, adopted by a separate 5% random masking of machine sort. Transformation with the NVTabular workflows maps the sentinels to out-of-vocabulary (OOV) index.

#MASK some customers and context options in practice information with 5% likelihood 
ANONYMOUS_USER = -1
OOV_GENDER = -1
OOV_TOP_CATEGORY = -1
OOV_DEVICE = -1

masked_train_dir = os.path.be a part of(input_path, "masked_train")
os.makedirs(masked_train_dir, exist_ok=True)

for i in vary(train_days):
    day = cudf.read_parquet(os.path.be a part of(input_path, f"train_day_{i:02d}.parquet"))
    n=len(day)
    user_mask = cupy.random.random(n) < 0.05
    day.loc[user_mask, "user_id"] = ANONYMOUS_USER
    day.loc[user_mask, "gender"] = OOV_GENDER
    day.loc[user_mask, "top_category"] = OOV_TOP_CATEGORY
        
    device_mask = cupy.random.random(n) < 0.05
    day.loc[device_mask, "device_type"] = OOV_DEVICE
    day.to_parquet(os.path.be a part of(masked_train_dir, f"train_day_{i:02d}.parquet"), index=False)
    del day
    gc.gather()
    
masked_train_paths = [os.path.join(masked_train_dir, f"train_day_{i:02d}.parquet") for i in range(train_days)]
masked_train_ds = Dataset(masked_train_paths)

full_workflow.rework(masked_train_ds).to_parquet(os.path.be a part of(output_path, "practice"))
full_workflow.rework(valid_raw).to_parquet(os.path.be a part of(output_path, "legitimate"))

To acquire the multimodal merchandise options, the product photographs are encoded utilizing OpenAI CLIP and the product descriptions are encoded utilizing Sentence-BERT. Each embeddings are diminished to 64-dimensional vectors by PCA and saved as lookup tables keyed by the NVTabular reworked merchandise IDs. The imply age computed by the consumer workflow is saved for later injection into the feast_user_lookup mannequin config. One other step prepares the offline and on-line characteristic artifacts. This step provides timestamps to the consumer and merchandise options, writes the ensuing options to the offline retailer, and materializes them into the web retailer for serving. On the identical time, world and category-specific reputation info are computed from the interplay information and written to the Valkey database (db=3).

*Determine 4: the Valkey database for merchandise reputation* (picture by creator)

Coaching the retrieval mannequin

The Two-Tower mannequin [jump to CODE] is skilled on consumer and merchandise options solely, with in-batch negatives and a contrastive loss. The question tower ingests the user-side options whereas the candidate tower consumes the merchandise options along with the precomputed picture and textual content embeddings. See Figures 5 and 6 for details about the NVTabular preprocessing and the enter block processing steps for every tower.

Determine 5: an illustration of the characteristic transforms with NVTabular and the steps within the enter block of the candidate tower. (picture by creator, and impressed by *prior work from Jeremy and Jordan*)

Coaching makes use of the primary 9 days of interplay information; analysis makes use of days 10 by 12. After coaching, the candidate encoder is run over the total merchandise catalog to compute merchandise embeddings. For this, a customized LookupEmbeddings operator (primarily based on Merlin’s BaseOperator) handles the multimodal embedding lookup when loading gadgets options in batches with Merlin’s data loader. These merchandise embeddings are used to construct the FAISS index for approximate nearest-neighbor retrieval. The question encoder is saved individually for on-line inference.

Determine 6: an illustration of the characteristic transforms with NVTabular and the steps within the enter block of the question tower. (picture by creator, and impressed by prior work from Jeremy and Jordan)

Coaching the rating mannequin

The DLRM ranker [jump to CODE] is skilled on the identical interplay information however with an expanded characteristic set. The characteristic set consists of merchandise options, consumer options, request-time context options (akin to machine sort and cyclical time-of-day and day-of-week options). The training goal is a binary click on label. These context options characterize situational elements that may form a buyer’s selection. As an illustration, a consumer would possibly interact extra with sure gadgets when shopping on their telephone versus a desktop, or present totally different preferences relying on the time of day or day of the week.

*Determine 7: the DLRM structure together with the characteristic transforms* (picture by creator)

Mannequin preparation and deployment

As soon as each fashions are skilled, the pipeline assembles the serving artifacts wanted by Triton. These embody the saved question tower, the DLRM ranker, the NVTabular rework fashions, the FAISS index and the lookup tables for the multimodal merchandise embeddings. The Triton mannequin repository is structured forward of time, so every deployment solely wants to repeat the mannequin artifacts into their versioned listing and inject runtime values like the common consumer age (for cold-start default), the retrieval topK, the rating topK and variety mode into the mannequin config recordsdata.

A helm chart deploys Triton Inference Server on EKS, begins the server in specific mode after which hundreds all of the fashions (see the beginning script).

#Triton beginning script
set -e
MODELS_DIR=${1:-"/mannequin/triton_model_repository"}

echo "Beginning Triton Inference Server"
echo "Fashions listing: $MODELS_DIR"

tritonserver 
    --model-repository="$MODELS_DIR" 
    --model-control-mode=specific 
    --load-model=nvt_user_transform 
    --load-model=nvt_item_transform 
    --load-model=nvt_context_transform 
    --load-model=multimodal_embedding_lookup 
    --load-model=query_tower 
    --load-model=faiss_retrieval 
    --load-model=dlrm_ranking 
    --load-model=item_id_decoder 
    --load-model=feast_user_lookup 
    --load-model=feast_item_lookup 
    --load-model=filter_seen_items 
    --load-model=softmax_sampling 
    --load-model=context_preprocessor 
    --load-model=unroll_features 
    --load-model=ensemble_model

Continuous fine-tuning pipeline

This Kubeflow pipeline handles every day mannequin updates. The pipeline depends on among the artifacts generated by the total coaching pipeline, due to this fact its elements mount the identical persistent quantity containing the saved artifacts.

*Determine 8: Kubeflow Pipelines UI displaying the incremental retraining pipeline DAG* (picture by creator)

Copy incremental information

Initially of this run, the pipeline copies the most recent interplay information from Amazon S3 along with a smaller replay set of older interactions. The replay portion provides the fine-tuning job a broader behavioral context and prevents the fashions from overfitting to solely the most recent sample.

Preprocess information

This step merges the historic consumer and merchandise options with the brand new interplay information, then transforms the info utilizing the fitted NVTabular workflows from the current full coaching job.

Positive-tune fashions

This step updates the question tower and the ranker. It initializes the Two-Tower mannequin from the earlier checkpoint however with the candidate encoder frozen so solely the question tower parameters are trainable. This permits the mannequin to adapt to the current consumer conduct whereas preserving the item-side embeddings utilized by the prevailing ANN index. A abstract of the Two-Tower mannequin displaying the frozen layers might be present in here.

The pipeline additionally initializes the DLRM ranker from the earlier checkpoint however trains all of the parameters utilizing a smaller studying charge and for fewer epochs.

As soon as coaching completes, it saves the fine-tuned question tower and the DLRM ranker to new model folders within the current Triton mannequin repository.

Promote fine-tuned fashions

This step calls Triton to load the brand new fashions. Triton serves in-flight requests on the prevailing mannequin variations whereas loading the brand new fashions within the background. Then it hot-swaps to the most recent mannequin variations as soon as they’re prepared.

*Determine 9: the query_tower and dlrm_ranker are each promoted to new variations after finetuning* (picture by creator)

Processing requests by the 14 fashions in NVIDIA Triton Inference server

The model repository accommodates 14 fashions throughout two backends. Python backends for characteristic lookups, characteristic transforms, and filtering; TensorFlow backends for the question tower and the DLRM ranker. An ensemble configuration wires all these fashions right into a directed acyclic graph (DAG) that NVIDIA Triton Inference server executes.

*Determine 10: an illustration of request processing within the Triton Inference Server* (picture by creator)

How context and consumer options are ready

Every request arrives with a consumer ID and an non-compulsory machine sort and request timestamp. If any context was lacking, the context_preprocessor imputes the defaults. For instance, the present server time is imputed for a lacking timestamp and an OOV sentinel is imputed for lacking machine sort. The context workflow transforms the context information into categorified machine index and 4 temporal options (hour sine/cosine, day-of-week sine/cosine).

Within the consumer path, feast_user_lookup fetches the consumer options from the web characteristic retailer (backed by ElastiCache for Valkey), then nvt_user_transform transforms the options utilizing the consumer workflow earlier than passing them to the question tower (query_tower). The question tower produces the consumer embeddings which faiss_retrieval makes use of to carry out similarity search, returning the topK merchandise IDs.

Dealing with consumer cold-start

When a consumer ID will not be discovered within the on-line characteristic retailer, feast_user_lookup makes use of defaults, i.e., user_id = -1, age = the coaching imply, gender = -1, and top_category=-1. The nvt_user_transform maps these user_id, gender, and top_category sentinels to their OOV indices and the imply age to the normalized worth and categorified age bucket. Then the query_tower generates the consumer embedding from the reworked options. Though faiss_retrieval returns the identical popularity-biased candidates for unknown customers, the DLRM ranker can nonetheless personalize the candidates ordering utilizing accessible context.

Seen-items filtering with a Bloom Filter

The candidate merchandise IDs are checked towards a Bloom filter in ElastiCache for Valkey. This step can remove a big variety of candidates, due to this fact over‑fetching on the retrieval stage is vital because it ensures the ranker receives sufficient candidates to provide a significant suggestion checklist.

The filtered merchandise IDs enter the merchandise characteristic pipeline the place feast_item_lookup retrieves the merchandise options from the web characteristic retailer, nvt_item_transform transforms these options utilizing the consumer workflow, and multimodal_embedding_lookup returns the pretrained CLIP (picture) and Sentence BERT (textual content) embeddings for the gadgets.

*Determine 11: RedisInsight UI displaying Bloom filter keys (gadgets) saved in ElastiCache, every with a 6-day TTL.* (picture by creator)

Rating and ordering

The unroll_features mannequin tiles the consumer and context options to match the retrieval candidate measurement. Then DLRM ranker (dlrm_ranking) scores the candidates. In softmax_sampling if DIVERSITY_MODE is disabled, the mannequin returns the topK candidates by descending rating; whether it is enabled, the mannequin makes use of score-based weighted sampling with out substitute to pick a various topK whereas nonetheless favoring higher-scoring gadgets. Lastly, item_id_decoder maps the ordered candidate IDs (NVTabular indices) again to the unique merchandise IDs, and Triton returns the chosen merchandise IDs along with their corresponding scores.

Enhancing merchandise characteristic lookup latency with in-memory caching

Server Profiling with Triton Performance Analyzer at retrieval measurement of 300 revealed that feast_item_lookup consumes 195 ms, which was roughly 52% of complete request latency at concurrency=1. Beneath load, the queue time ballooned from 36 ms (at concurrency=1) to 988 ms (at concurrency=4). This capped throughput at 2.9 inferences per second no matter what number of concurrent requests had been issued.

*Determine 12a: Optimizing characteristic lookup latency with caching (picture by creator)*

The bottleneck was feast_item_lookup fetching options for 300 candidates from Feast’s on-line retailer on each request. To alleviate this, Feast requires merchandise options had been changed with an in-process NumPy array cache. Primarily, at feast_item_lookup initialization, all merchandise options are fetched as soon as from Feast and saved as NumPy arrays listed by merchandise ID, so each request reads options from reminiscence as an alternative of constructing community calls to the web characteristic retailer. This optimization resulted in about 99.7% enchancment within the feast_item_lookup latency, and a 54% enchancment within the end-to-end latency (at concurrency=1). Additionally, the throughput (at concurrency=4) improved by 310%. The one trade-off is that the cached options solely refresh on Triton restart, nevertheless, for a catalog with pretty static merchandise attributes, this isn’t problematic.

*Determine 12b: Latency outcomes earlier than and after in-memory characteristic caching* (picture by creator)

After this modification, the three NVTabular rework fashions nvt_user_transform (72ms), nvt_item_transform (41ms), and nvt_context_transform (39ms) accounted for roughly 88% of remaining latency. Additional mannequin optimizations are deferred to a future model of this undertaking.

Autoscaling the Triton Inference Server on EKS

on this undertaking, the Triton Inference Server is autoscaled by way of Kubernetes Horizontal Pod Autoscaler (HPA) primarily based on a customized metric — the common time (in milliseconds) that every request spent ready within the queue during the last 30 seconds. When this latency exceeds the goal, the HPA scales up the Triton deployment by rising the specified pod reproduction rely. If the brand new Triton pod can’t be scheduled as a result of no GPU node has capability for a brand new pod, Karpenter provisions a brand new GPU node and provides it to the cluster. As soon as the node turns into accessible, the Kubernetes scheduler locations the Triton pod on it. As soon as the brand new pod is prepared, the load balancer can start routing site visitors to it.

*Determine 13: Autoscaling Triton Inference Server with K8s HPA and Karpenter* (picture by creator)

Validating contextual suggestions, Bloom filter filtering, and close to real-time suggestion updates.

To validate the system, variety mode was turned off throughout deployment to isolate its impact from these of context varieties, Bloom filter filtering, and choice shift on suggestions.

Validating contextual suggestions

To validate contextual suggestions, I experimented with a number of request varieties, together with requests with solely a consumer ID and requests that mixed consumer ID with contextual options akin to machine sort and timestamp. These exams confirmed that suggestions for unknown customers range with context. A chilly-start consumer can obtain totally different ranked merchandise lists relying on the machine sort and request time. For current customers, the impact of context was much less pronounced. The general rating remained largely steady throughout contexts, though the output scores different.

A demo of context results on suggestions for current (consumer ID= 1009) and new consumer (userID = 12345678). Video by creator.

Validating Bloom filter seen-items filtering

To validate seen-item exclusion by the Bloom filter, a number of gadgets from the Beneficial for You carousel had been clicked. These gadgets had been excluded from subsequent suggestions by the Bloom filter. To keep away from shifting the consumer’s inferred choice and confounding the Bloom filter check, click on gadgets from totally different classes.

Within the video demonstrating the Bloom filter filtering, we observe that clicked gadgets akin to Decadent Chocolate Dream Cake and Classic Explorer’s Canvas Backpack are excluded from Consumer 12345678‘s subsequent suggestions.

Video demonstration of the Bloom filter excluding beforehand interacted gadgets (video by creator).

Validating close to real-time suggestion updates

To validate close to real-time suggestion updates for current customers, the check begins by first fetching suggestions for a consumer to determine the consumer’s present choice. That is adopted by clicking a number of gadgets from the identical class, for instance, gadgets belonging to solely Equipment or Furnishings or Groceries, then ready for about 5 seconds for the updates to take impact. The repeated interactions with gadgets in the identical class can shift the consumer’s inferred choice if that class differs from the consumer’s present top_category. The top_category characteristic represents the dominant class among the many gadgets a consumer has interacted throughout the previous 24 hours and is recomputed after every interplay. On the following request, the mannequin can rank gadgets from that newly expressed curiosity class greater and floor them among the many high suggestions.

Within the video demonstrating dwell adjustments in suggestions, we discover Consumer 1003‘s high suggestions change from Equipment to House Decor (and furnishings) because of repeated interactions with gadgets within the Furnishings class.

Demonstration of actual‑time rating adjustments triggered by shifts in consumer choice alerts (video by creator)

Notice, nevertheless, that the top_category characteristic is a crude approximation of short-term curiosity used to exhibit the system’s potential to adapt to consumer conduct in real-time. For richer short-term curiosity modeling, the following iteration of this undertaking would substitute the static question tower with a session-based transformer encoder.

Limitations and Future Work

Within the present structure, request-side context, akin to machine sort and timestamp-derived options, is used solely by the ranker. This was an implementation option to hold the retrieval easy, since including context at retrieval time would require computing extra options throughout candidate era. Nonetheless, if request context influences which gadgets needs to be retrieved, related candidates could also be filtered out earlier than the ranker sees them.

A future path is so as to add request-side context options to the question tower, so each retrieval and rating turn into context-aware. One other path is to switch the present question tower with a session encoder, which might extra faithfully seize brief‑time period consumer behaviour than the present behavioural characteristic approximation (i.e., top_category).

Conclusion

This publish walked by a multistage multimodal recommender system for an ecommerce use case, deployed on Amazon EKS. The system combines Two-Tower candidate retrieval, context-aware DLRM rating, and a score-based variety rating. The system makes use of tabular consumer and merchandise options, multimodal embeddings primarily based on product photographs and textual content descriptions, and context info.

Chilly-start is addressed by characteristic masking throughout coaching, which forces the fashions to depend on a discovered OOV embedding and context alerts when consumer is new or unknown. This implies nameless and new customers obtain suggestions that adapt to their machine sort and the time of their request, somewhat than a static fallback checklist. Bloom filters forestall already-seen gadgets from resurfacing throughout repeated classes, and in-memory caching of merchandise options helped resolve the latency bottleneck on the merchandise characteristic lookup stage. Additionally, real-time adaptation of the system to altering behavioral sign is demonstrated by way of the top_category characteristic.

On the MLOps facet, two Kubeflow pipelines handle the system lifecycle. One pipeline for full coaching and deployment, and the opposite for every day fine-tuning of the question tower and ranker with out rebuilding the merchandise embedding index. Karpenter and Kubernetes HPA deal with compute scaling in response to request load.

The system exhibits a production-style recommender methods wherein a retrieval stage optimized for pace and recall is mixed with a rating stage optimized for precision, and an infrastructure layer designed to maintain fashions up to date with out full retraining on each cycle. Please discover the total code on this repository: MustaphaU/multistage-recommender-system-on-kubernetes

I hope you loved studying this! I sit up for your questions.

Assets

Mustapha Unubi Momoh, Multistage Multimodal Recommender System on Kubernetes, GitHub repository. Obtainable: https://github.com/MustaphaU/multistage-recommender-system-on-kubernetes
Even Oldridge and Karl Byleen‑Higley, “Recommender Methods, Not Simply Recommender Fashions,” NVIDIA Merlin (Medium), Apr. 2022. Obtainable: https://medium.com/nvidia-merlin/recommender-systems-not-just-recommender-models-485c161c755e
Radek Osmulski, “Exploring Manufacturing‑Prepared Recommender Methods with Merlin,” NVIDIA Merlin (Medium), Jul. 2022. Obtainable: https://medium.com/nvidia-merlin/exploring-production-ready-recommender-systems-with-merlin-66bba65d18f2
Jacopo Tagliabue, Hugo Bowne‑Anderson, Ronay Ak, Gabriel de Souza Moreira, and Sara Rabhi, “NVIDIA Merlin Meets the MLOps Ecosystem: Constructing a Manufacturing‑Prepared RecSys Pipeline on Cloud,” NVIDIA Merlin (Medium), Feb. 2023. Obtainable: https://medium.com/nvidia-merlin/nvidia-merlin-meets-the-mlops-ecosystem-building-a-production-ready-recsys-pipeline-on-cloud-1a16c156166b.
Benedikt Schifferer, “Fixing the Chilly‑Begin Downside Utilizing Two‑Tower Neural Networks for NVIDIA’s E‑Mail Recommender Methods,” NVIDIA Merlin (Medium), Jan. 2023. Obtainable: https://medium.com/nvidia-merlin/solving-the-cold-start-problem-using-two-tower-neural-networks-for-nvidias-e-mail-recommender-2d5b30a071a4.
Ziyou “Eugene” Yan, “System Design for Suggestions and Search,” eugeneyan.com, Jun. 2021. Obtainable: https://eugeneyan.com/writing/system-design-for-discovery/.
Haoran Yuan and Alejandro A. Hernandez, “Consumer Chilly Begin Downside in Suggestion Methods: A Systematic Evaluation,” IEEE Entry, vol. 11, pp. 136958–136977, 2023. Obtainable: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10339320
Justin Wortz and Justin Totten, “Scaling Deep Retrieval with TensorFlow Recommenders and Vertex AI Matching Engine,” Google Cloud Weblog, Apr. 19, 2023. Obtainable: https://cloud.google.com/blog/products/ai-machine-learning/scaling-deep-retrieval-tensorflow-two-towers-architecture
Sam Partee, Tyler Hutcherson, and Nathan Stephens, “Offline to On-line: Characteristic Storage for Actual‑time Suggestion Methods with NVIDIA Merlin,” NVIDIA Technical Weblog, Mar. 1, 2023. Obtainable: https://developer.nvidia.com/blog/offline-to-online-feature-storage-for-real-time-recommendation-systems-with-nvidia-merlin/