Jaideep Ray – Medium

Jaideep Ray

Published in
Better ML

Overcoming inter-node bandwidth bottleneck through Hybrid-FSDP

Optimize distributed training with network bottlenecks

Jan 20

Overcoming inter-node bandwidth bottleneck through Hybrid-FSDP

Jan 20

Published in
Better ML

Teach an old dog new tricks: LLM continual pre-training (CPT)

Continue pre-training on new tasks => Data and compute efficient models.

Jan 3

Teach an old dog new tricks: LLM continual pre-training (CPT)

Jan 3

Published in
Better ML

Evaluating Pre-training Datasets

Better dataset evaluation => Robust datasets => Powerful models

Jan 1

Evaluating Pre-training Datasets

Jan 1

Published in
Better ML

Tackling OOM: Strategies for Reliable ML Training on Kubernetes

Tackle OOMs => reliable training => win !

Dec 28, 2024

Tackling OOM: Strategies for Reliable ML Training on Kubernetes

Dec 28, 2024

Published in
Better ML

ML training & Remote Direct Memory Access (RDMA)

zero-copy data transfer => faster communication => larger models.

Dec 28, 2024

ML training & Remote Direct Memory Access (RDMA)

Dec 28, 2024

Interesting paper. I included the key observations.

Thanks for comments !

Dec 27, 2024

Dec 27, 2024

Published in
Better ML

Quantization Aware Training (QAT) vs. Post-Training Quantization (PTQ)

Smaller models => Faster inference => Better outcomes

Dec 25, 2024

Quantization Aware Training (QAT) vs. Post-Training Quantization (PTQ)

Dec 25, 2024

Published in
Better ML

Storage architecture for distributed training

In distributed training, multiple GPUs in a single node or in multiple nodes work together to train a single model. This requires massive…

Dec 23, 2024

Storage architecture for distributed training

Dec 23, 2024

Published in
Better ML

Metrics for evaluating LLMs

ELO Rating, BLEU, Perplexity and Cross Entropy

Dec 23, 2024

Metrics for evaluating LLMs

Dec 23, 2024

Published in
Better ML

The art of setting learning rate

The learning rate is a training hyperparameter that has a small positive value between 0.0 and 1.0 (e.g. 1e-5). During training, the…

Nov 29, 2024

The art of setting learning rate

Nov 29, 2024

Jaideep Ray

Jaideep Ray

Sr. Staff ML Engineer https://www.linkedin.com/in/jaideepray/

Following

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech