Paweł Dziepak (@paweldziepak) 's Twitter Profile
Paweł Dziepak

@paweldziepak

ID: 738457735105220608

linkhttps://paweldziepak.dev calendar_today02-06-2016 19:50:18

260 Tweet

472 Followers

183 Following

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

Isn't it great when string_view::substr doesn't get inlined and is 31 bytes? That's -O3 LTO, but callers are cold, which is supposed to be "optimise for size", but is more like "don't inline stuff". There's also pos bound checking, which C++ for some reason requires for substr.

Isn't it great when string_view::substr doesn't get inlined and is 31 bytes? That's -O3 LTO, but callers are cold, which is supposed to be "optimise for size", but is more like "don't inline stuff". There's also pos bound checking, which C++ for some reason requires for substr.
Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

Now that CUDA11 is out, some more questions about Ampere are answered. The ISA is the same as Volta and Turing (plus the extensions for new features). At first glance it also looks like there are only minor changes to fixed latencies (and penalties).

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

I've been catching up on 400Gb/s ethernet phy and the main observation is that 802.3-2018 is a nearly 100MB PDF that still contains things like 2BASE-TL. I'm worried that further evolution is going to be hampered by that fact that no PDF reader will be able to open the spec.

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

I'm generally not very interested in old papers that didn't lead anywhere (so far?) – I'm not a historian. However, I must admit, "A subnanosecond HEMT 1Kb SRAM" is quite an impressive title for a paper published in 1984.

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

The Dune books have some strong opinions regarding atomic instructions: “Use of atomics [...] shall be cause for planetary obliteration.” Unclear whether this applies to LL/SC as well, but I’m inclined to agree regardless.

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

It appears that NOP on Cortex-A72 has throughput of 1, which seems rather slow considering that the front end can deliver 3 µops per cycle. On a more positive note, looks like GCC for aarch64 knows that spilling GP registers to the FP ones is an option.

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

Apparently, there is NUMA node interleaving with cache line granularity. As wrong as it feels, it sort of makes sense if the DRAM throughput is all that matters. The question still remains if it is not a serious design flaw when a single socket needs so much bandwidth.

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

Pro tip: don't make typos in an inline asm that's inside a loop unrolled 10000 times. It's going to be the assembler that reports the errors. All 10000 of them.

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

I really need to take into account how shaky my hands are next time I’m designing a PCB with lots of 0402s that I need to assemble myself. Well, at least I didn’t use any 0201.

I really need to take into account how shaky my hands are next time I’m designing a PCB with lots of 0402s that I need to assemble myself.

Well, at least I didn’t use any 0201.
Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

(Almost) everything is a matrix multiplication. That includes a few transmission lines and some switches. I've been playing around with 2x2 unitary matrix-complex vector multiplication using 5.8GHz microwave signals. Finally, wrote a post about this. paweldziepak.dev/2021/05/16/mat…

Andreas Abel (@uops_info) 's Twitter Profile Photo

Today, I released uiCA, the "uops.info Code Analyzer". uiCA is based on data from uops.info, combined with a new detailed pipeline model. An online version (that also supports other tools) is available at uica.uops.info (1/3)

Paweł Dziepak (@paweldziepak) 's Twitter Profile Photo

SMT solvers are great but it is way too easy to go from a problem that takes a day to solve to a question that won’t be answered before Earth is destroyed by the expanding Sun. Running it under ‘screen’ won’t help with the latter.

Pete Cawley (@corsix) 's Twitter Profile Photo

Given: 1. crc32 has throughput 1 on port 1 2. pclmulqdq has throughput 1 on port 5 3. pclmulqdq+pxor can emulate crc32 It seems that fastest crc32 code should divide input in half and issue a crc32 _and_ a pclmulqdq every cycle. Code and numbers at corsix.org/content/fast-c…

Tanel Poder 🇺🇦 (@tanelpoder) 's Twitter Profile Photo

Very cool article and I learned multiple new details! Discovering Hard Disk Physical Geometry through Microbenchmarking blog.stuffedcow.net/2019/09/hard-d…

Very cool article and I learned multiple new details!

Discovering Hard Disk Physical Geometry through Microbenchmarking

blog.stuffedcow.net/2019/09/hard-d…