Dr Pawd (@drpawd) Twitter Tweets • TwiCopy

Abhinav Upadhyay

2 years ago

Couple of more people started supporting me after this post. I am very grateful. But there were also few failed attempts. This frequently happens because of Stripe and Indian regulations. There's an alternate way to support my work as well: You can buy me coffee(s) or become a

thumb_up_off_alt32

chat_bubble_outline2

repeat5

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

I found two optimizations that the CPython has done to improve the performance of its bytecode interpreter and to circumvent the cost of wrong branch prediction when executing bytecode. Every bytecode interpreter (VM) is implemented using a giant switch case inside a loop. The

thumb_up_off_alt150

chat_bubble_outline5

repeat20

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

Admittedly, I hate floating-points. But this really got me curious. If you tried the same tests in C or Java, you would get the expected results. But why does Python fail in the 2nd test? The answer goes back to its implementation details. Languages like C or Java have implicit

thumb_up_off_alt134

chat_bubble_outline3

repeat16

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

On the topic of profilers, I am doing a live session tomorrow on the internals of remote sampling profilers for a language like Python. blog.codingconfessions.com/p/live-session…

thumb_up_off_alt22

chat_bubble_outline0

repeat3

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

I made a short video giving an overview of profiling in Python. It covers tracing vs sampling profilers, and also gives a quick demo of a few profilers, including cProfile, py-spy, and perf! blog.codingconfessions.com/p/python-profi…

thumb_up_off_alt147

chat_bubble_outline0

repeat33

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

Last Sunday, we did a live session on the internals of a remote sampling profiler (for Python). These profilers work by attaching to a running process, reading its memory on demand and extracting the stack trace of the currently executing code. We covered the following details:

thumb_up_off_alt43

chat_bubble_outline0

repeat8

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

Today I published a comprehensive 5000 word article on the design & implementation of the GC in CPython. Took me many weeks to get this out. Here's a summary: CPython primarily uses reference counting for GC. Every object maintains reference count in its header and the runtime

thumb_up_off_alt298

chat_bubble_outline7

repeat49

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

In my next live session, I will discuss how hyper-threading works at the microarchitecture level, right from instruction fetch/decode to scheduling and execution. Detail here: blog.codingconfessions.com/p/live-session…

thumb_up_off_alt139

chat_bubble_outline3

repeat19

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

A data structure is not just about the theoretical space and time complexity. To achieve its full potential on a real computer, you also need to implement it with mechanical sympathy for the hardware. Hash tables are a very popular data structure, which power higher level data

thumb_up_off_alt106

chat_bubble_outline0

repeat12

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

Simultaneous multithreading (SMT) enables the processor to execute instructions for two threads simultaneously. But why was it needed and how does it work? It was needed to improve the resource utilization of the processor. Processors are capable of execution many instructions

thumb_up_off_alt189

chat_bubble_outline5

repeat31

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

Many people pointed out that function call overhead is the reason for this. That is true. Function calls are expensive because they require setting up a stackframe in the interpreter, and passing the arguments. However, Python has had many performance improvements in recent

thumb_up_off_alt314

chat_bubble_outline10

repeat34

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

As promised I wrote an analysis about the cost of function calls, builtin calls and inlined code in Python using microbenchmarks. I explain in detail what recent changes in CPython have improved the perf in these areas and how. I try to connect the dots between the slow parts

thumb_up_off_alt288

chat_bubble_outline3

repeat46

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

The Linux kernel's implementation of context switch b/w two threads on X86. 1. save registers on previous task's stack 2. Switch stack pointers 3. Restore registers from the new task's stack Interesting to see the code to prevent attacks due to return stack buffer (RSB)

thumb_up_off_alt324

chat_bubble_outline2

repeat44

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

Beautiful implementation of partially ordered sets in the go compiler. This is why you study discrete math!

thumb_up_off_alt528

chat_bubble_outline4

repeat45

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

My new article on the design & implementation of the CPython VM is out. It is my most comprehensive article yet, at 5500 words and 17 code listings, such as this: The VM is the most central piece of any interpreted language because this is how your code eventually executes. As a

thumb_up_off_alt730

chat_bubble_outline6

repeat111

shareShare

Abhinav Upadhyay

@abhi9u

2 years ago

In my latest article I do a survey of speculative decoding techniques which are used widely to increase LLM inference efficiency and cut costs. Inspired by how CPUs do speculative execution of instructions to increase instruction throughput and to execute programs faster,

thumb_up_off_alt101

chat_bubble_outline3

repeat19

shareShare

Abhinav Upadhyay

@abhi9u

a year ago

I write "Confessions of a Code Addict", so here is a confession: Even though I make fun of LLMs every now and then, I've actually been using AI coding assistants from the early days. I've used GitHub Copilot since its beta release, and Cursor from the initial releases.

thumb_up_off_alt73

chat_bubble_outline2

repeat8

shareShare

Abhinav Upadhyay

@abhi9u

a year ago

Just noticed that my latest article which is the first in a series on the internals of context switching in Linux is on the front page of HN. The article covers the core data structures for the process and memory state, and covers details which are critical for saving and

thumb_up_off_alt412

chat_bubble_outline5

repeat36

shareShare

Abhinav Upadhyay

@abhi9u

a year ago

How do you fit a 250kB dictionary in 64kB of RAM and do lookups? For reference, even gzip -9 cannot compress this file beyond 85kB. In the 1970s, Douglas McIlroy at AT&T had the same challenge when implementing the spell checker for Unix. Instead of relying on generic

thumb_up_off_alt3,3K

chat_bubble_outline45

repeat460

shareShare

Abhinav Upadhyay

@abhi9u

a year ago

This is an Apidog appreciation post. I've built APIs all my life, and I know how painful and frustrating it can get when working with multiple teams. Keeping documentation, spec and code in sync, tracking dependencies and coordinating with multiple teams is not fun when

thumb_up_off_alt60

chat_bubble_outline1

repeat4

shareShare