Secure Data Sharing with Snowflake Data Clean rooms

Secure Data Sharing with Snowflake Data Clean rooms

In the age of data-driven digital transformation, ensuring the security of your business data has never been more important. Although the significance of data has long been understood, it has only sometimes been acknowledged. The growing tendency for sharing data has been influenced by a general movement toward openness in research, particularly where it is publicly funded, as well as by funders who have made it a necessity to show a better return on investment. For any business, data is a priceless resource that is generated, acquired, saved, and exchanged. A corporation can avoid financial loss, reputational damage, a decline in consumer trust, and brand erosion by safeguarding it from internal and external malfeasance and unauthorized access. Today, through this post, we will be taking a close look at how Snowflake provides secure data-sharing capabilities.

With Secure Data Sharing, you may share certain databases’ tables, views, and user-defined functions with other Snowflake accounts. Every shared database object across accounts is read-only (i.e. the objects cannot be modified or deleted, including adding or modifying table data). With Secure Data Sharing, no actual data is copied or transferred between accounts. All sharing is accomplished through Snowflake’s unique services layer and metadata store. This is an essential concept because it means that shared data does not take up any storage in a consumer account and, therefore, does not contribute to the consumer’s monthly data storage charges. The only costs to consumers are for the compute resources (i.e. virtual warehouses) used to query the shared data. Also, these shared databases cannot be re-shared with other accounts.

Data sharing in Snowflake is live, the instant data is populated in a table that is shared, it is instantly viewable by the other Snowflake account to which the share and the table have been granted. The Row Access Policy on the shared publisher data table ensures that the live content is only viewable by approved queries. It is not just taking the data and sending it, it is ensuring that the data being shared is shared with the right person, applied to the right role, and for the right use case. Snowflake also supports data sharing across the cloud and region. Data is very well governed and access can be revoked after the end of the project or contract. or if anything happens.

To understand this process, we first need to get familiar with ‘Shares’. Shares are named Snowflake objects that encapsulate all of the information required to share a database. Each share consists of:

  • The privileges that grant access to the database and the schema containing the objects to share.
  • The consumer accounts with which the database and its objects are shared.

Once a database is created (in a consumer account) from a share, all the shared objects are accessible to users in the consumer account. Shares are secure, configurable, and controlled 100% by the provider account. New objects added to a share become immediately available to all consumers, providing real-time access to shared data. Access to a share (or any of the objects in a share) can be revoked at any time.

Now if the objects to be shared are in one database only, a share can be created for that database, and access to all objects can be given. But if objects are in multiple databases, then we need to create a view from multiple databases and then create a share for this view. And this way objects from multiple databases can be shared. Shared databases are read-only. Users in a consumer account can view/query data, but cannot insert or update data, or create any objects in the database.

The following actions are not supported:

  • Creating a clone of a shared database or any schemas/tables in the database.
  • Time Travel for a shared database or any schemas/tables in the database.
  • Editing the comments for a shared database.
  • Shared databases and all the objects in the database cannot be forwarded (i.e. re-shared with other accounts).

Only the roles used to create a database from a share can use the database initially. However, access can be granted to other roles through a special data-sharing-specific privilege named IMPORTED PRIVILEGES.

A role can grant IMPORTED PRIVILEGES on an imported database only when it either:

  • Owns the imported database (i.e. has the OWNERSHIP privilege on the database).
  • Was granted the MANAGE GRANTS global privilege.

Creating SF streams on shared objects (secure views or tables) enables you to track data manipulation language (DML) changes made in those objects. This functionality is similar to creating and using streams on “local” objects (i.e. in the same account as the stream).

demand forecast

Snowflake does not place any hard limits on the number of shares you can consume from data providers; however, you can only create one database per share. Access to this database is configurable using the same, standard role-based access control that Snowflake provides for all objects in the system.

Now there may arise a situation when a company wants to share data with some vendors who are not Snowflake’s customers. Reader accounts (formerly known as “read-only accounts”) enable providers to share data with consumers who are not already Snowflake customers, without requiring the consumers to become Snowflake customers.

A reader account enables data consumers to access and query data shared by the provider of the account, with no setup or usage costs for the consumer, and no requirements for the consumer to sign a licensing agreement with Snowflake.

The reader account is created, owned, and managed by the provider account, which assumes all responsibility for credit charges incurred by users in the reader account. Similar to standard consumer accounts, the provider account uses shares to share databases with reader accounts; however, a reader account can only consume data from the provider account that created it. Adding new data to the account and/or updating shared data in the reader account is not supported.

Sharing data while adhering to privacy regulations has always been challenging. For example, if we know customers saw our advertisement and later bought products, we can understand if we are getting a return on our marketing expenses. While these types of attribution analytics have happened in the past, the new Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR) regulations now block or complicate this analysis. But by using data clean rooms, it’s now possible to collaborate with data in a secure manner that aligns with privacy rules.

Data clean rooms control what data comes in, how the data is joined to other data sets in the clean room, what types of analytics each party can perform, and what data can be exported. Traditional clean rooms require physical infrastructure; others rely on a trusted third party. But modern data clean rooms are not physical spaces and don’t require moving data into a different system or environment. In fact, that is one of the advantages of Snowflake. Your data doesn’t need to be moved; it can remain secure and governed within your Snowflake environment.

Using key Snowflake characteristics, the Snowflake Data Clean Room is a framework or design pattern for secure, multi-party cooperation. To put it another way, if you have a business-critical or enterprise Snowflake account, you already have everything you need to set up a Snowflake Data Clean Room.

Let us try to understand with help of an example. In 1982 a computer scientist and computational theorist Andrew Yao posed a secure multi-party computation problem known as Yao’s Millionaires’ problem. The problem discusses two millionaires, Alice and Bob, who are interested in knowing which of them is richer without revealing their actual wealth. They want to share the data to know the underlying information but they did not want to actually show how much their wealth is. In other words, Alice can’t ask what the wealth value is.

In an SQL environment that may look like this query.

select bob.wealth from Bob.wealth_table

But since Bob has not allowed such types of queries to run on his data, it will give no result. She can only ask about how the two relate to each other and get the answers that Bob has specified. Prior to setting up a data clean room, they both discussed and Bob finalized below query as an approved query.

when bob.wealth > alice.wealth then 'bob is richer'

when bob.wealth = alice.wealth then 'neither is richer'

else 'alice is richer' end

And it will give the result when queried by Alice.

This enables Alice’s data to be compared to Bob’s but without her sharing her wealth value and without Bob sharing his. If Alice tries to ask anything other than the allowed questions, she will get no results. This is precisely what a data clean room is.

Of course, companies do not want to compare their wealth but there are other use cases. The capabilities of distributed data clean rooms are especially beneficial to advertisers and the media industry as we continue moving to a cookieless future. The most popular use case of data clean rooms is to link anonymized marketing and advertising data from multiple parties for attribution. Data clean rooms don’t allow data points that could be tied back to a specific user to leave the environment, giving organizations the ability to adhere to privacy laws.

Brands can leverage their growing first-party customer data with third-party enrichment to refine targets for marketing campaigns. Media companies can securely match their first-party subscriber data, across their own platforms, with brands’ customer data to improve ad placement. Agencies, or the advertisers or media companies themselves, can combine campaign logs, identity, attribution, and sales data to measure the performance of an ad campaign. Lookalike analysis finds lookalike segments to grow target audiences And there are many more such use cases across multiple industries.

Now, It’s important to distinguish between traditional data clean rooms and distributed data clean rooms. With traditional data clean rooms, all data is stored in a single physical location, limiting how the data can be shared. With the developments of cloud technology, distributed data clean rooms eliminate the need to move data from one location to another since the data can live in the cloud. This allows each partner to control its own data while enabling governed analytics with other partners, or even with multiple other partners, simultaneously.

Now, let’s discuss some features of DCR which make them useful. Snowflake’s data clean room can be multiparty - across different snowflake accounts. We can have a larger ecosystem of customers, particularly, in ads campaigns, there is not just a single brand or media outlet, there is a complete ecosystem of agencies and partners, and other third-party data providers which come together to plan and analyze the data.

These DCRs can be Cross-region - we can execute that collaboration across regions. Snowflake is multi-cloud meaning participating parties in a Snowflake data clean room can participate from a mix of AWS, Azure, and GCP platforms! This functionality is extended using Snowflake data replication between platforms but still follows the same data clean room deployment. And it can also be cross-instance. This gives us a lot of flexibility on how we structure the data clean room and how we execute that collaboration.

SF features that help build DCR are:

  • Row Access Policies along with time travel functionality - Row Access Policy (RAP) is a schema-level object that determines whether a given row in a table or view can be queried by a Snowflake role or account. We will use RAPs to secure publisher user data by way of an approved query list (a mapping table). Only an approved query from the consumer will return query results from the publisher.
  • The time travel functionality of snowflake helps to refer to a particular point of time in the allowed statement so that there is no narrowing around specific customers or data to get a better idea of the actual numbers
  • Snowpark and Stored Procedures help to generate and validate query requests.
  • A Stored Procedure enables users to create modular code that can include complex business logic by combining multiple SQL statements with procedural logic.
  • Secure Data Sharing for automatically and securely sharing tables between multiple snowflake accounts without the need for movement outside of Snowflake.
  • Streams & Tasks. A stream object records data manipulation language (DML) changes made to tables and a task object is a scheduled SQL code to run on a time-based schedule. We will use the combination on a shared table from the consumer to poll for new entries in that table.

Let us now summarize the quick steps that we need to follow for building a data clean room.

  • Get a Snowflake account for each company or group (or have an existing Snowflake customer provide a secure sub-account of their account).
  • Load data into Snowflake (done by each company or group).
  • Establish a Snowflake Private Data Exchange between the participants.
  • Configure secure functions and secure joins to protect the data.
  • Perform analysis of the joint data using standard analysis tools, while respecting each party’s privacy policies and boundaries.

The architecture pattern is repeatable, meaning that If there is a database that party 2 wants to share with party 1, a consumer of a data clean room can become a provider by implementing the same pattern in reverse. Ultimately the responsibility for ensuring that the approved queries do not reveal sensitive 3rd party data is with the publisher in collaboration with the consumer. And this can be tested by the publisher and templatized for use by the respective consumer.

BluePi is one of the top providers of Snowflake services in India, offering businesses a wide range of Snowflake services and solutions, including data lake, data warehousing, data analytics, and data engineering. With its expertise in Snowflake, BluePi helps organizations leverage the full potential of the platform to drive digital transformation through data-driven insights. The company provides businesses with a comprehensive suite of Snowflake services and solutions, including Snowpark services, to help them manage and analyze their data more effectively.

Sidhant Arora
Written by
Ayushi Kaushik

Associate Technical Lead

We love to hear from you.

Our team would be happy to answer your questions.