Luca Canali(@LucaCanaliDB) 's Twitter Profile Photo

🚀 New Blog Post: 'Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization.' Get tips, tools, and short demos to boost your Spark performance. Ideal for developers! 🛠️

Read more: db-blog.web.cern.ch/node/195

🚀 New Blog Post: 'Building an Apache Spark Performance  Lab: Tools and Techniques for Spark Optimization.' Get tips, tools, and  short demos to boost your Spark performance. Ideal for developers! 🛠️#ApacheSpark #Performance

Read more: db-blog.web.cern.ch/node/195
account_circle
Coiled(@CoiledHQ) 's Twitter Profile Photo

How does Dask compare to Apache Spark? We’ve re-run the TPC-H benchmarks and noticed some interesting results 🧵

docs.coiled.io/blog/spark-vs-…

How does @dask_dev  compare to @ApacheSpark? We’ve re-run the TPC-H benchmarks and noticed some interesting results 🧵

docs.coiled.io/blog/spark-vs-…
account_circle
No Priors(@NoPriorsPod) 's Twitter Profile Photo

New🔥 Ep#11: sarah guo // conviction & Elad Gil talk to Matei Zaharia, founder Databricks, creator of Apache Spark, Stanford University CS professor:
- Dolly, betting on small models
- scaling asymptotes
- LLMs in the enterprise
- academic -> founder/CTO of $1B+ revenue co
🎙no-priors.com

account_circle
Kyle Weller(@KyleJWeller) 's Twitter Profile Photo

Amazon revealed the data arch of their package delivery platform. Since working from home, I witness a steady stream of packages on my porch and I'm starting to wonder how many GBs of data my spouse has contributed to this dataset... 💸


🧵link below👇

Amazon revealed the data arch of their package delivery platform. Since working from home, I witness a steady stream of packages on my porch and I'm starting to wonder how many GBs of data my spouse has contributed to this dataset... 💸
#apachehudi #apachespark

🧵link below👇
account_circle
Matthew Powers(@neapowers) 's Twitter Profile Photo

Did you know that you can query PySpark DataFrames with SQL now without creating a temporary table/view?

This is a huge quality of life improvement for users and shows how Apache Spark is continuously improving.

Did you know that you can query PySpark DataFrames with SQL now without creating a temporary table/view?

This is a huge quality of life improvement for #pyspark users and shows how @ApacheSpark is continuously improving.
account_circle
Khuyen Tran(@KhuyenTran16) 's Twitter Profile Photo

Retrieving all rows from a large dataset into memory can cause out-of-memory errors. DataFrame delays computations until collect() is called, allowing for row reduction through filtering or aggregating.

This results in more efficient memory usage.

Retrieving all rows from a large dataset into memory can cause out-of-memory errors. #ApacheSpark DataFrame delays computations until collect() is called, allowing for row reduction through filtering or aggregating.

This results in more efficient memory usage.
account_circle
Matei Zaharia(@matei_zaharia) 's Twitter Profile Photo

One of my favorite announcements: English SDK for Apache Spark! No more need to remember weird syntax, just chain transformations in natural language with the familiar Spark API. So many fun examples.
databricks.com/blog/introduci…

One of my favorite announcements: English SDK for @ApacheSpark! No more need to remember weird syntax, just chain transformations in natural language with the familiar Spark API. So many fun examples.
databricks.com/blog/introduci…
account_circle
Khuyen Tran(@KhuyenTran16) 's Twitter Profile Photo

In PySpark, parametrized queries enable the same query structure to be reused with different inputs, without rewriting the SQL.

Additionally, they safeguard against SQL injection attacks by treating input data as parameters rather than as executable code.

In PySpark, parametrized queries enable the same query structure to be reused with different inputs, without rewriting the SQL.

Additionally, they safeguard against SQL injection attacks by treating input data as parameters rather than as executable code.

#ApacheSpark
account_circle
Jacek Laskowski @jaceklaskowski@fosstodon.org(@jaceklaskowski) 's Twitter Profile Photo

And we know what's coming in 4.0.0. This version surely makes us all long-time Spark users soooo OLD! 😆

And I'd not be surprised if some tricks of mine may've happened to be outdated already 😉

Named parameters in SQL statements are already available since 3.5.

And we know what's coming in #ApacheSpark 4.0.0. This version surely makes us all long-time Spark users soooo OLD! 😆

And I'd not be surprised if some tricks of mine may've happened to be outdated already 😉

Named parameters in SQL statements are already available since 3.5.
account_circle
Khuyen Tran(@KhuyenTran16) 's Twitter Profile Photo

Duplicated code in queries can lead to inconsistencies if changes are made to one instance of the duplicated code but not to others.

Apache Spark UDFs can help address these issues by encapsulating complex logic that is reused across multiple SQL queries.

Duplicated code in #SQL queries can lead to inconsistencies if changes are made to one instance of the duplicated code but not to others.

@ApacheSpark UDFs can help address these issues by encapsulating complex logic that is reused across multiple SQL queries.
account_circle
Anna Geller(@anna__geller) 's Twitter Profile Photo

New blog post: a deep dive into dataframes and table abstractions featuring polars data, DuckDB, pandas, dbt, Apache Spark, Dask, Ponder, Fugue Project, ... — when to use which framework and how do they compare or integrate with each other

New blog post: a deep dive into dataframes and table abstractions featuring @DataPolars, @duckdb, @pandas_dev, @getdbt, @ApacheSpark, @dask_dev, @ponderdata, @fugue_project, ... — when to use which framework and how do they compare or integrate with each other
account_circle
Khuyen Tran(@KhuyenTran16) 's Twitter Profile Photo

Spark enables scaling of your pandas workloads across multiple nodes. However, learning PySpark syntax can be daunting for pandas users.

Pandas API on Spark enables leveraging Spark's capabilities for big data while retaining a familiar pandas-like syntax.

Spark enables scaling of your pandas workloads across multiple nodes. However, learning PySpark syntax can be daunting for pandas users.

Pandas API on Spark enables leveraging Spark's capabilities for big data while retaining a familiar pandas-like syntax.

#apachespark #pandas
account_circle
Khuyen Tran(@KhuyenTran16) 's Twitter Profile Photo

3.5 added new array helper functions that simplify the process of working with array data. Below are a few examples showcasing these new array functions.

🚀 View other array functions: bit.ly/4c0txD1
⭐️ Bookmark this post: bit.ly/3TnNCM3

#ApacheSpark 3.5 added new array helper functions that simplify the process of working with array data. Below are a few examples showcasing these new array functions.

🚀 View other array functions: bit.ly/4c0txD1
⭐️ Bookmark this post: bit.ly/3TnNCM3
account_circle
Eléa(@EleaPetton) 's Twitter Profile Photo

Salle comble au pour le talk de Claire et Adrien sur le traitement d'images distribué au service de l' ! Un réel gain de temps pour le traitement de données 😉... OVHcloud_Tech

Salle comble au #VeryTechTrip pour le talk de Claire et Adrien sur le traitement d'images distribué au service de l'#IA ! Un réel gain de temps pour le traitement de données 😉... @OVHcloud_Tech

#apachespark #dataprocessing
account_circle