Dataiku and Snowflake: A Joint AI Solution

December 19, 2023

•

0 min read

•

Ben Gardner-Moss

Analytics

No items found.

Dataiku

Snowflake

No items found.

Ready to unlock the full potential of your analytics journey? Combining the robust capabilities of Dataiku and Snowflake creates a complimentary analytics platform, combining Snowflake’s unparalleled computational power and processing flexibility with Dataiku’s specialty in visual data ingestion, transformation, machine learning, and model management.

There are 4 key benefits to using Dataiku and Snowflake together:

Simplicity: Dataiku provides a user-friendly interface for accessing and analyzing data in Snowflake, making it simple to operationalize data transformation pipelines and machine learning models.
Performance: Dataiku and Snowflake easily scale to tackle big data, providing elastic computing performance that matches growing data and analytics requirements.
Operationalization: Dataiku and Snowflake support access to the same data by multiple collaborating groups simultaneously without diminishing performance.
Scalability and Cost Control: Dataiku integrates with Snowflake for computation, allowing customers to use highly scalable cloud computing on Snowflake and only pay for the computation they use.

In this blog, we’ll address the 5 technical components within Dataiku that allow users to achieve these benefits:

Connections
Datasets
Recipe Computation
Snowpark Integration
SQL Pipelines

Connections

Connections allow users to interact with cloud-based data sources from within Dataiku.

The basic parameters available for connecting to Snowflake should be easily recognizable for any Snowflake user, where you can set the ‘Host,’ ‘Database,’ ‘Warehouse,’ ‘Role,’ and ‘Schema’ you wish to connect to. All but the ‘Host’ option can be left empty to use the default objects for the user interacting with that connection.

Variables can also be leveraged to dynamically control the Warehouse and Role used by a project when interacting with that connection.

This is valuable when your organization tries to attribute query execution costs to the appropriate business unit. A project variable could be applied to indicate that the “Employee Retention” flow should use the “HR” warehouse, while the “Month End Reconciliation” flow should use the “Finance” warehouse.

In addition to providing the basic set of parameters for connecting to Snowflake, users can configure the ‘Fast Write’ functionality to allow users to bulk load objects into Snowflake via an external stage, such as Amazon S3, Azure Blob, or Google Cloud Storage.

From an authentication perspective, businesses can use OAuth, global credentials, or per-user credentials. The connection can then be made available to all analysts or a selected user group.

Above, we have highlighted some key features of the connection experience; however, it is worth exploring Dataiku’s documentation to understand the further parameters that users can control when connecting with Snowflake.

Datasets

Datasets provide the primary method of extracting data from and loading data into Snowflake from within Dataiku.

When using Snowflake as an input source for your flow, alongside being able to either visually select the dataset you wish to input or drop in an SQL script, users can also choose to activate partitioning. Partitioning is a concept in Dataiku that allows users to process chunks of data in an isolated fashion. It should be noted that Dataiku’s partitioning feature is unrelated to Snowflake’s.

When using Snowflake as an output dataset for any recipe, users can designate whether they wish to append to or overwrite the target table. Pre and post-SQL statements can also be applied to customize the output further. A simple example might be creating a post-SQL script to set appropriate permissions on the table being created.

Recipe Computation

In Dataiku, you can control the execution engine for your recipes. When your source dataset for a recipe is a dataset stored on Snowflake, you can choose either the ‘In-database (SQL)’ execution engine or the ‘DSS’ execution engine.

When the ‘In-database (SQL)’ recipe engine is available, it will typically provide the user with faster computation.

This computation engine translates your recipe into a SQL script which is then executed in the database where your source dataset exists; this means that Dataiku isn’t moving data outside of Snowflake, thus minimizing the time associated with read/write processes between these platforms.

	Small Dataset	Medium Dataset	Large Dataset
DSS	12s	312s	> 172800*
In-database (SQL)	5s	6s	24s

* the DSS recipe engine did not finish processing within 48 hours

Not all recipe components are supported for ‘In-database (SQL)’ computation. In such cases, users may wish to fall back to either writing their desired transformation via an SQL code recipe (if they still wish to benefit from the use of the ‘In-database (SQL)’ computation engine, or alternatively use the DSS recipe engine.

It’s also important to highlight that Snowflake credits will be consumed as the In-database (SQL) recipe engine is passing the compute load to Snowflake.

Snowpark Integration

Snowpark is a developer framework that brings native SQL, Python, Java, and Scala support to Snowflake. This is just to say that you can get Snowflake’s elastic and secure data processing when using these languages.

In the video below, our team member, Data Engineer Alaisha Alexander, demonstrates how to integrate Snowpark with Dataiku via the Python Code recipe.

In this video, Alaisha highlights:

Snowpark effectively creates an In-database (SQL) recipe engine for your Python recipe, resulting in significant performance gains.
You can initiate Snowpark sessions using your existing Dataiku Snowflake connections.
Working in Snowpark can streamline your Python code as you don’t have to consider the implications of working with large datasets in memory.
Using Snowpark within the Dataiku platform will consume Snowflake credits.

	Small Dataset	Medium Dataset	Large Dataset
Python	13s	159s	4288s
Snowpark	5s	6s	24s

SQL Pipelines

SQL pipelines provide a mechanism for increasing the computational speed of your flow by minimizing the required read/write steps. With SQL pipelining turned on, Dataiku will, in effect, combine a series of consecutive recipe steps into a single SQL query for Snowflake to process.

The two images above visually represent this feature, with the second image on the right highlighting that our intermediary datasets will no longer be created.

If you want to materialize the data for one of your intermediary datasets, this can be done by ‘allowing build virtualization’ in the settings for that dataset.

Maximize your investment in Snowflake and Dataiku

We have a robust team of top data scientists, data engineers, management consultants, and data experts who develop, refine, and deploy end-to-end analytics applications using cutting-edge techniques.

Both partners acknowledged our team’s skillset in 2023 when we achieved Snowflake Elite Partner status and were named Dataiku’s Partner of the Year.

Beyond taking advantage of the technical benefits of Snowflake and Dataiku, Aimpoint can support your organization in adopting analytic capabilities to help you further realize the full value proposition of these tools.

If you want to learn more about how Snowflake and Dataiku can help your business, complete the form below.

Author