Sherin Jacob (@jcsherin) Twitter Tweets • TwiCopy

Andrew Lamb

5 months ago

It is a common misconception that Apache Parquet files are restricted to basic statistics. Footer metadata and offset-based addressing permit user-defined index structures today. Latest ApacheDataFusion blog from Qi Zhi, Jigao Luo and myself explains how datafusion.apache.org/blog/2025/07/1…

It is a common misconception that <a href="/ApacheParquet/">Apache Parquet</a> files are restricted to basic statistics. Footer metadata and offset-based addressing permit user-defined index structures today.

Latest <a href="/ApacheDataFusio/">ApacheDataFusion</a> blog from Qi Zhi, Jigao Luo and myself explains how datafusion.apache.org/blog/2025/07/1…

thumb_up_off_alt155

chat_bubble_outline3

repeat18

shareShare

Azim Afroozeh

@afroozeh3

5 months ago

I'm excited to share that our paper (in collaboration with Peter Boncz ) has been accepted at VLDB 2025 in London and will be presented there: The FastLanes File Format In this paper, we introduce the FastLanes file format with Expression Encoding—a new way to define and combine

thumb_up_off_alt10

chat_bubble_outline0

repeat5

shareShare

Andrew Lamb

@andrewlamb1111

5 months ago

Mutli-level merge sort queued up for DataFusion 50.0.0 next month: github.com/apache/datafus… Thanks to Raz Luvaton ⬢ and Yongting You

thumb_up_off_alt43

chat_bubble_outline1

repeat6

shareShare

Maximilian Kuschewski

@maxikuschewski

5 months ago

Andrew Lamb One improvement regarding benchmaxxing is having thousands of diverse benchmark queries instead of dozens. Plugging the new SQLStorm paper below ;)

<a href="/andrewlamb1111/">Andrew Lamb</a> One improvement regarding benchmaxxing is having thousands of diverse benchmark queries instead of dozens. Plugging the new SQLStorm paper below ;)

thumb_up_off_alt4

chat_bubble_outline1

repeat2

shareShare

Stefan Marr

@smarr

4 months ago

How can you slow down a program? And perhaps more importantly, why would you? Blog post on our upcoming VMIL Workshop at SPLASH paper. stefan-marr.de/2025/08/how-to… The research was led by Humphrey Burchell.

thumb_up_off_alt15

chat_bubble_outline5

repeat5

shareShare

Peter Boncz

@peterabcz

4 months ago

Tobias Schmidt (TUM) at VLDB 2025 🇬🇧 presented SQLStorm, which uses LLMs to generate a huge amount of large queries. SQLStorm now has 18K different complex queries and runs on a large real-world dataset (stackoverflow) paper: vldb.org/pvldb/vol18/p4… code: github.com/SQL-Storm/SQLS…

Tobias Schmidt (TUM) at <a href="/VLDBconf/">VLDB 2025 🇬🇧</a> presented SQLStorm, which uses LLMs to generate a huge amount of large queries.

SQLStorm now has 18K different complex queries and runs on a large real-world dataset (stackoverflow)

paper: vldb.org/pvldb/vol18/p4…
code: github.com/SQL-Storm/SQLS…

thumb_up_off_alt20

chat_bubble_outline0

repeat4

shareShare

Andrew Lamb

@andrewlamb1111

3 months ago

Recording of "Introduction to Variant in Apache Parquet ": youtube.com/watch?v=nlOJD7… Here are the slides: docs.google.com/presentation/d…

Recording of "Introduction to Variant in <a href="/ApacheParquet/">Apache Parquet</a> ": youtube.com/watch?v=nlOJD7…

Here are the slides: docs.google.com/presentation/d…

thumb_up_off_alt119

chat_bubble_outline1

repeat13

shareShare

Andrew Lamb

@andrewlamb1111

3 months ago

Dynamic Filters for TopK and Join queries landing in DataFusion 50.0.0: datafusion.apache.org/blog/2025/09/1…

thumb_up_off_alt47

chat_bubble_outline2

repeat10

shareShare

Xuanwo

@onlyxuanwo

3 months ago

People asked me about how OpenDAL makes money: the answer is it doesn’t. OpenDAL is for public goods, it helps you to access storage services and make money 🫡

thumb_up_off_alt28

chat_bubble_outline4

repeat1

shareShare

Andy Pavlo (@andypavlo.bsky.social)

@andy_pavlo

3 months ago

The sordid backstory is that there was an collaboration attempt to unify on a single format with CMU, Tsinghua, Meta, CWI, Voltron, Nvidia, and SpiralDB. The plan was to create a consortium and start with Meta's Nimble. But then lawyers got involved and it all fell apart.

thumb_up_off_alt31

chat_bubble_outline1

repeat4

shareShare

Andy Pavlo (@andypavlo.bsky.social)

@andy_pavlo

3 months ago

So instead of working together, everyone (including us) released their own format: → velox-lib Nimble: github.com/facebookincuba… → CWI DA FastLanes: github.com/cwida/FastLanes → Spiral Vortex: vortex.dev

thumb_up_off_alt28

chat_bubble_outline4

repeat5

shareShare

Andrew Lamb

@andrewlamb1111

2 months ago

Our new thrift parser in the Rust Apache Parquet implementation is a 🎁 that keeps on giving performance wise 🚀 github.com/apache/arrow-r… We are also working on a blog post that has a deeper explanation

Our new thrift parser in the Rust <a href="/ApacheParquet/">Apache Parquet</a> implementation is a 🎁 that keeps on giving performance wise 🚀 github.com/apache/arrow-r…

We are also working on a blog post that has a deeper explanation

thumb_up_off_alt136

chat_bubble_outline2

repeat8

shareShare

Sherin Jacob

@jcsherin

2 months ago

New post -- A B+Tree Node Underflows: Merge or Borrow? jacobsherin.com/posts/2025-08-… An interesting engineering trade-off I stumbled upon implementing a concurrent B+Tree from scratch; where production databases diverge from textbook algorithms, and each does it their own way.

thumb_up_off_alt73

chat_bubble_outline0

repeat13

shareShare

v

@iavins

2 months ago

We use asserts all the time in Turso DB and also in the Turso Server. They're in release builds and shipped to production. And yes, they could crash the server. Asserts are my favorites, and I use them whenever possible. Just yesterday I merged a PR that contained asserts and

thumb_up_off_alt144

chat_bubble_outline11

repeat6

shareShare

Angelo 🇵🇷

@ngeloxyz

2 months ago

First one is: "Speedrunning the lakehouse" by Jacopo Tagliabue (CTO of Bauplan) He asks: What if we started from scratch? Building a lakehouse infrastructure from scratch. Hilarious, funny, and informative youtube.com/watch?v=dvBRC9…

thumb_up_off_alt7

chat_bubble_outline1

repeat2

shareShare

Wes McKinney

@wesmckinn

2 months ago

Excited to announce a new side project, a power user terminal UI for your personal finances: moneyflow.dev For years I've used personal finance tools like Mint and now Monarch. The data cleaning can be slow and tedious, so I made this to speed that up!

thumb_up_off_alt195

chat_bubble_outline6

repeat13

shareShare

Andrew Lamb

@andrewlamb1111

2 months ago

ApacheDataFusion 's policy for AI assisted contribution: AI is great, but not AI dumps: maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge when acting as a pass through AI proxy. datafusion.apache.org/contributor-gu…

thumb_up_off_alt33

chat_bubble_outline4

repeat4

shareShare

Xuanwo

@onlyxuanwo

2 months ago

For everyone interested in data infra, want to get a quick sense of how big data works, how data systems are designed, and what the tradeoffs are, start with this share from Xiangpeng Hao, really nice intro! intro-data-system.xiangpeng.systems

thumb_up_off_alt297

chat_bubble_outline7

repeat40

shareShare

Jasim

@jasim_ab

2 months ago

Been working on a tiny LLM service to help me write prompts just like regular well-typed application code. Here's a sample use case - map freeform text to an address form:

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare

samlaf

@samlafer

2 months ago

New paper by Nancy Lynch summarizing her career's influence on the field of distributed computing. arxiv.org/pdf/2502.20468 If you don't know who she is, she's the L in FLP and DLS. Marc Brooker has a good summary article: brooker.co.za/blog/2014/05/1…

thumb_up_off_alt66

chat_bubble_outline5

repeat18

shareShare