[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
Nathan Barry posts on X about take the, send this, gpu, number of the most. They currently have XXX followers and XX posts still getting attention that total XXX engagements in the last XX hours.
Social category influence currencies stocks technology brands
Social topic influence take the, send this, gpu, number of, $googl, faster, meta, hardware
Top assets mentioned Alphabet Inc Class A (GOOGL)
Top posts by engagements in the last XX hours
"Research Log Day 0: DiLoCo Days I decided to a thesis around distributed low-communication training. Essentially how can we train large models efficiently across distributed nodes and not be utterly destroyed by network latency and bandwidth (1/n)"
X Link @nathanbarrydev 2025-10-17T00:25Z XXX followers, XXX engagements
"The main approach currently is Local SGD. We have M distributed workers. Each take H local optimization steps. After each worker finishes their H steps we take the distance of their starting and ending parameter state to get an "outer-gradient". (2/n)"
X Link @nathanbarrydev 2025-10-17T00:28Z XXX followers, XX engagements
"We update the original weights with the outer-gradient using an outer-optimizer and send this updated weight back to each worker. The image in the first tweet is pseudo-code for the Local SGD / DiLoCo algorithm (3/n)"
X Link @nathanbarrydev 2025-10-17T00:29Z XXX followers, XX engagements
"This algorithm was originally developed for Federated Learning with computing using edge devices in mind. Recent work has attempted to get this to work in data center settings to train LLMs where each worker might be a giant GPU cluster (4/n)"
X Link @nathanbarrydev 2025-10-17T00:31Z XXX followers, XX engagements
"Training across data centers using naive DDP isn't possible due to the network latency and bandwidth across geo-distributed clusters. Local SGD / DiLoCo allows us to significantly reduce the total amount of communication substantially (5/n)"
X Link @nathanbarrydev 2025-10-17T00:33Z XXX followers, XX engagements
"There are many variants of DiLoCo currently most which try and improve it by minimizing communication cost even more by masking it with computation. One of the problems is that each method introduces a different kind of staleness which negatively effects convergence (6/n)"
X Link @nathanbarrydev 2025-10-17T00:35Z XXX followers, XX engagements
"I have two current directions. The first is to study the impact of heterogeneous workers. The Async Local SGD paper tested giving each workers a different number of inner steps based on relative processing speed. A worker half as fast takes half as many inner steps (7/n)"
X Link @nathanbarrydev 2025-10-17T00:37Z XXX followers, XX engagements
"Research Log Day 2: HALoS Communication between regions is drastically worse than within a region. Most previous DiLoCo variants didn't take this into account. HALoS (Hierarchical Async Local SGD) fixes this by adding local and global parameter servers (1/n)"
X Link @nathanbarrydev 2025-10-17T21:05Z XXX followers, XXX engagements
"(1/n) A desirable problem to solve is being able to train on heterogeneous hardware. Even within the same generation NVIDIA B300 GPUs are XX% faster than B200s. Companies with many clusters (Meta Google etc) would ideally be able to train a model across their clusters regardless of the underlying hardware"
X Link @nathanbarrydev 2025-10-18T23:20Z XXX followers, XX engagements