mohit (@mohitwt_) 's Twitter Profile
mohit

@mohitwt_

19 • dl • stay hard

ID: 1929833932427874304

linkhttps://pluto0.streamlit.app/ calendar_today03-06-2025 09:35:02

1,1K Tweet

244 Followers

122 Following

mohit (@mohitwt_) 's Twitter Profile Photo

wrote my first CUDA, a simple c = a + b: what its doing: > allocate CPU mem and fill with input data > allocate GPU mem > copy input data CPU to GPU > gpu kernel computes > copy result back gpu to cpu > print results from CPU > free GPU mem 😭

wrote my first CUDA, a simple c = a + b:

what its doing:
> allocate CPU mem and fill with input data
> allocate GPU mem
> copy input data CPU to GPU
> gpu kernel computes
> copy result back gpu to cpu
> print results from CPU
> free GPU mem
😭
mohit (@mohitwt_) 's Twitter Profile Photo

matrix addition in CUDA, the first picture is launching 1 block of NxN threads. ~ each thread computes one element of matrix > dim3 threadsPerBlock(N,N); > MatAdd<<<1, threadsPerBlock>>>(d_A, d_B, d_C, N); this works fine with small matrices as i did N=4, which is only 16

matrix addition in CUDA, the first picture is launching 1 block of NxN threads.
~ each thread computes one element of matrix 

&gt; dim3 threadsPerBlock(N,N);
&gt; MatAdd&lt;&lt;&lt;1, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);

this works fine with small matrices as i did N=4, which is only 16
Raj Nair (@rajnair06) 's Twitter Profile Photo

(1/n) Building Learnflow -implemented throttling for Free/Premium Users 10/day for free and 1000 for premium(for the sake of testing) -cached monthly and weekly progress summaries -benchmark tests for cache and non-cached responses, insane diff more info below

(1/n)
Building Learnflow
-implemented throttling for Free/Premium Users
10/day for free and 1000 for premium(for the sake of testing)
-cached monthly and weekly progress summaries
-benchmark tests for cache and non-cached responses, insane diff
more info below
Raj Nair (@rajnair06) 's Twitter Profile Photo

Building LearnFlow -learnt about background jobs and queues -used Celery as task queue system and Redis as the message broker -offloaded goal inactivity reminder email sending task to background through this more details below

Building LearnFlow
-learnt about background jobs and queues
-used Celery as task queue system and Redis as the message broker
-offloaded  goal inactivity reminder email sending task to background through this
more details below
mohit (@mohitwt_) 's Twitter Profile Photo

A Detailed Explanation of CUDA Thread Hierarchy (Threads, Blocks, and Grids): A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,

A Detailed Explanation of CUDA Thread Hierarchy
(Threads, Blocks, and Grids):

A CUDA thread is the smallest unit of execution on the GPU, similar to a CPU thread, but designed for massive parallelism, each thread has its own registers and runs the same kernel code independently,
mohit (@mohitwt_) 's Twitter Profile Photo

update on my framework: > started implementation of Autograd Engine: the autograd engine has few key tasks: track tensor dependencies, compute gradients automatically, and backpropagate through operations. > added computation graph in the framework: forward pass works for

update on my framework:

&gt; started implementation of Autograd Engine: 
the autograd engine has few key tasks: 
track tensor dependencies, compute gradients automatically, and backpropagate through operations.

&gt; added computation graph in the framework:
forward pass works for
mohit (@mohitwt_) 's Twitter Profile Photo

A short post on CUDA Programming Model: CUDA programming model, the way cuda lets developers actually write programs for massively parallel processors (hardware that runs thousands of threads at the same time) according to the NVIDIA CUDA c++ guide, cuda is built around three

A short post on CUDA Programming Model:

CUDA programming model, the way cuda lets developers actually write programs for massively parallel processors (hardware that runs thousands of threads at the same time)

according to the NVIDIA CUDA c++ guide, cuda is built around three
mohit (@mohitwt_) 's Twitter Profile Photo

made a medium account where i’ll be sharing detailed explanations about my framework. i’ll keep posting small progress updates on X, things like new features i added or quick milestones, but on medium, i’ll go deep into the actual implementation. i’ll be breaking down how each

made a medium account where i’ll be sharing detailed explanations about my framework. i’ll keep posting small progress updates on X, things like new features i added or quick milestones, but on medium, i’ll go deep into the actual implementation.

i’ll be breaking down how each
mohit (@mohitwt_) 's Twitter Profile Photo

ran a distilgpt2 model locally, tested with different prompts, tweaked generation parameters to observe variations in output. I'll share more on this tmr, gn.