Sherin Jacob
@jcsherin
Programmer
ID: 3515788533
https://protoship.io/ 01-09-2015 20:01:16
264 Tweet
226 Followers
153 Following
It is a common misconception that Apache Parquet files are restricted to basic statistics. Footer metadata and offset-based addressing permit user-defined index structures today. Latest ApacheDataFusion blog from Qi Zhi, Jigao Luo and myself explains how datafusion.apache.org/blog/2025/07/1…
I'm excited to share that our paper (in collaboration with Peter Boncz ) has been accepted at VLDB 2025 in London and will be presented there: The FastLanes File Format In this paper, we introduce the FastLanes file format with Expression Encoding—a new way to define and combine
Mutli-level merge sort queued up for DataFusion 50.0.0 next month: github.com/apache/datafus… Thanks to Raz Luvaton ⬢ and Yongting You
Andrew Lamb One improvement regarding benchmaxxing is having thousands of diverse benchmark queries instead of dozens. Plugging the new SQLStorm paper below ;)
How can you slow down a program? And perhaps more importantly, why would you? Blog post on our upcoming VMIL Workshop at SPLASH paper. stefan-marr.de/2025/08/how-to… The research was led by Humphrey Burchell.
Tobias Schmidt (TUM) at VLDB 2025 🇬🇧 presented SQLStorm, which uses LLMs to generate a huge amount of large queries. SQLStorm now has 18K different complex queries and runs on a large real-world dataset (stackoverflow) paper: vldb.org/pvldb/vol18/p4… code: github.com/SQL-Storm/SQLS…
Recording of "Introduction to Variant in Apache Parquet ": youtube.com/watch?v=nlOJD7… Here are the slides: docs.google.com/presentation/d…
Our new thrift parser in the Rust Apache Parquet implementation is a 🎁 that keeps on giving performance wise 🚀 github.com/apache/arrow-r… We are also working on a blog post that has a deeper explanation
ApacheDataFusion 's policy for AI assisted contribution: AI is great, but not AI dumps: maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge when acting as a pass through AI proxy. datafusion.apache.org/contributor-gu…
For everyone interested in data infra, want to get a quick sense of how big data works, how data systems are designed, and what the tradeoffs are, start with this share from Xiangpeng Hao, really nice intro! intro-data-system.xiangpeng.systems
New paper by Nancy Lynch summarizing her career's influence on the field of distributed computing. arxiv.org/pdf/2502.20468 If you don't know who she is, she's the L in FLP and DLS. Marc Brooker has a good summary article: brooker.co.za/blog/2014/05/1…