If you’ve seen a demo, read a blog, or consumed any content about Snowpark for Python, you’re probably aware that, syntactically, it’s incredibly similar to PySpark. In many ways, that’s a blessing. Upskilling in Snowpark for Python is simple for PySpark users, and migrating from PySpark to Snowpark is made easier for it. However, it has given rise to the notion that they’re the same thing. Don’t let syntactical similarity deceive you; there are plenty of meaningful differences between the two and many reasons why Snowflake customers could benefit when migrating from PySpark to Snowflake for Python.
PySpark is a Python library used to develop Python solutions with the aid of Apache Spark, which is an open-source engine for performing distributed data transformation, data processing, and machine learning across a cluster. Basically… instead of doing lots of data work on one machine, you can use a cluster of machines to run that work at scale.
Typically, the management of that cluster falls upon the user, which is a demanding task, but solutions such as AWS Glue and Databricks exist to ease that burden.
Snowpark for Python Infrastructure
Snowpark for Python is a Python library for developing Python solutions in Snowflake. Snowflake’s platform is a unified architecture that allows you to integrate data from a wide range of data sources and has processing that can elastically scale to deal with even the biggest and scariest workloads. Unlike Spark or Spark-managed solutions, the developers and end users don’t have to deal with job tuning or scaling. With Snowflake, everything is wholly managed, including metadata management, partitioning, clustering, and much more, especially around security and governance.
User Defined Functions (UDFs) Comparison
PySpark UDFs Challenges
Often, built-in transformations in any solution don’t do the niche and specific things our end users require. In these circumstances, the answer is to program a UDF to execute those transformations. If you’re a PySpark user, you’ll know that’s easier said than done.
First of all, you need to pick the type of UDF; there are standard UDFs, partial functions, and Pandas UDFs… each has its advantages, but in general, you’re advised to avoid them altogether. UDFs are a black box as far as Spark is concerned, and using them will mean you lose all the optimization advantages Spark offers. Spark will process your data using one node when a UDF is in the mix.
Snowpark UDFs Advantages
Snowpark – and I don’t think a big enough deal is made of this – doesn’t have this problem. The documentation doesn’t delve into the reasons why Snowpark holds this advantage; I suspect it might be to do with the ease in which Snowflake’s cloud services layer (the brains of the operation) and their virtual warehouses (the muscle) interact seamlessly, and the way in which Snowflake performs micro-partitioning. Speculation aside, what we do know is that UDFs don’t slow Snowpark down in the way they slow Spark down.
Another point to mention here is that UDFs can be pre-registered in your Snowflake account, meaning they don’t need to be declared and installed as part of your ETL/ELT job, which can be a huge time-saver.
Using Python Libraries
The process for using other Python libraries alongside PySpark is difficult when done on a cluster you manage and only slightly easier when done on a more managed service. In Snowpark, it’s relatively easy. As part of Snowflake’s partnership with Anaconda, they have a curated list of ready-to-go Python libraries you can use straight away – the syntax for importing such libraries is:
And the dependency management is made a lot easier thanks to the integrated Conda package manager. The Conda package manager is embedded into the UDF creation flow behind the scenes to reduce time spent in “dependency hell,” the scourge of any PySpark developer.
You can follow the process outlined in the Snowflake documentation to bring your own Python libraries.
Snowflake Features in Snowpark
Let’s remember that Snowflake comes with many features besides just Snowpark, which can be taken advantage of while doing Snowpark development.
In Snowpark, your session object can, for example:
- Execute session commands – Session.use_schema(), Session.use_role(), Session.use_warehouse() are all available functions
- Execute context functions – Session.get_current_account(), Session.get_current_role() etc…
- Perform admin tasks, such as performing file operations on stages ( Session.file.get() / Session.file.put() ), set query tags ( Session.query_tag() ), register stored procedure (Session.sproc.register() )
- Work with semi-structured data, for example VARIANT flattening with Session.flatten()
- Interact with files staged in external stages directly with DataFrameReader objects
Snowflake already does a lot of cool stuff, and it’s only natural that Snowpark is a more tailored solution that brings a lot of that Snowflake functionality with it. Only a little of the above cannot be done in PySpark, but it all involves writing, at times, convoluted SQL.
Given the early days of this technology, there are only so many reliable published figures comparing Snowpark like-for-like with other technologies. Still, customers are reporting 2-3x performance improvements at 30-50% of the cost, which can translate into a 10x price-performance gain.
And this makes sense. Snowflake virtual warehouses use vectorized execution engines, the benefits of which are well-documented. That’s a challenging thing to set up yourself. Moreover, Snowflake offers Snowpark-optimized virtual warehouses with significantly more memory available for those really memory-intensive workloads, such as machine learning (ML).
Occam’s Razor: Simplifying Your Architecture
The 14th-century monk, William of Occam, gets a hard time from philosophers of knowledge for formulating the principle that when faced with two competing theories, one should opt for the simpler because probability dictates that more can be wrong with the more complex theory.
In engineering, though, Occam has a point. When designing our ETL/ELT pipelines, the more tools we introduce, the higher the chance of everything breaking. If your data resides in Snowflake (along with places easily accessed by Snowflake), why not introduce greater simplicity to your architecture?
One of the significant causes of pipeline breakage is bugs in the communication between two or more tools, version upgrades can play havoc with transfers that have been working oh so well up until that dreaded monitoring email hits your inbox.
After all, pluralitas non est ponenda sine necessitate – plurality should not be posited without necessity.
Wouldn’t you love to quote that to your CTO?
Exploring Snowpark for Python with Our Video Tutorial
I’ve recently recorded a YouTube video titled “Snowpark for Python | Snowflake Tutorial” to provide you with a hands-on demonstration of Snowpark’s capabilities in Python. The video offers insight into Snowpark, an API on top of Snowflake that allows developers to transform data, design models, and create data-centric applications using programming languages such as Scala, Java, and Python.
In the tutorial, we delve into creating data frames, understanding the concept of lazy execution, and utilizing methods such as .save(), .collect(), and .join(). By the end of the video, you’ll have a solid foundation in using Snowpark to query and process data in Python, enhancing your data engineering skills.
This video tutorial is designed for everyone, from expert data engineers to those just beginning to learn about data. Snowpark programming is an essential skill in today’s data-driven world, and our video provides a comprehensive introduction to get you started.
Interested in Migrating from PySpark to Snowpark for Python?
If you’re interested in moving from PySpark to Snowpark for Python, we here at Aimpoint Digital can help. Not only can we help assess the pros and cons of migration, but if you go ahead, we have a team of Snowflake, PySpark, and Snowpark developers who can help you convert your PySpark jobs to Snowpark. So, if this sounds like something you might want to look into, don’t hesitate to reach out through the form below.