When data is imported into Snowflake, it is automatically partitioned into continuous micro-partitions. Groups of rows in tables are mapped into separate micro-partitions and arranged in a columnar structure. Tables are transparently partitioned using the ordering of the data as it is inserted/loaded. Due to their size and structure, very big tables—which may include millions or even hundreds of millions of micro-partitions—can be pruned at an incredibly fine level.
Additional Information on Snowflake Micro-Partitions can be found at:
- https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions
- https://community.snowflake.com/s/article/understanding-micro-partitions-and-data-clustering
We cover many of the various benefits associated with micro-partitioning in our Meetup event below:
Other Meetups and Events to check out:
Transcript from the Meetup event:
Thank you, everybody, for joining the Snowflake Micro-Partitions Carolinas Snowflake Meetup Group Event. I am Mike Jelen. As tradition is the same here, we record and make these available within a couple of days on our YouTube channel, which you can access via our website or obviously go directly to the source itself.
And so when we think of today’s introduction, we’ve got myself one of the co-founders of AICG of Christian Screen, who’s also co-founder. And we have Heather Pierson who is a project lead and director of marketing for us.
So when you think of Snowflake, for some of you, Snowflake is near and dear to you. You’ve been on your Snowflake journey for a while. Others are either just getting started or perhaps you’re thinking about getting started.
Of course, everyone is welcome, always welcome. Thank you again for joining us. When you think of Snowflake as a whole, lots and lots of different data sources out there. Transactional applications, they could sit up in the cloud, they could be on-premise, you have a combination of them.
You can have lots of data out there. You might be an IoT organization, you might be an organization that is cranking out tons of log data and you’re trying to make sense of that log data from an application or perhaps even from networking equipment.
And you’re just trying to centralize all that data into the Snowflake data cloud. That integration could be just data streaming. That real-time. You might be just doing, hey, I only want to bring over what data has changed.
I don’t want to replicate the full source every time. And then you’re transforming that data to fit your different consumers. You may have consumers that are looking at operational reporting because they are hitting a data warehouse of some sort or maybe a data mark as well.
And you’re also looking to share that data with a wider audience, much like the operational reporting. You may have data scientists that want to look at more raw data because they are going to come to their own conclusions.
They don’t want to have aggregated data, they don’t want to have massage data. That’s applying business rules. They want to go and try to find those anomalies or patterns that they are looking for at the end of the day.
So all those different user groups, Snowflake caters to all of them. But naturally, the most important part is you control what data gets accessed by whom, and you also control how much resources that they are able to use in regards to consuming that data.
So when you think of micro partition, what is it, why is it important? Should I care? Should I not care? Well, that’s what today’s topic is going to be is when we think of a micro partition is a way to organize the data at the very root level of how the data is stored behind the scenes in Snowflake.
When you think of Snowflake, Snowflake at the end of the day is technically a UI that sits on top of highly compressed and encrypted parquet files. And the way that Snowflake accesses those files is that it breaks things up into small chunks based upon how the data has been loaded and then how the data is used.
And those small chunks are the micro partitions. And so that enables you to very quickly and cost-effectively access large amounts of data, especially when you have a query or an analysis that is not necessarily doing a select all from these giant tables.
You’re looking for very specific KPIs or information that you’re teeing up in a digestible format, in a human-readable format. So to show all the data across a very wide table doesn’t make sense, that’s not helpful for the user.
So micro partitions help from that storage perspective and really accessing that storage. And so micro partitions are automatically added to all tables really upon creation and loading within data within those.
And really it’s all about the sequencing. So this is the default behavior. That Snowflake Micro Partition has. And so we’ll talk about what if the default behavior isn’t exactly what I’m looking for.
Well great, that is also handled as well here. And so when you think of how to apply micro partitions, you have the creates a replace function within Snowflake workspace base. For example, here where you’re defining the cluster.
So this is where you are dictating what that micro partition should be and you’re setting it instead of using the default based upon how the data is entered into or loaded into the table. So to go down even further when we get into the brass tax of what’s really happening to the data.
So you have a logical structure when you think of this particular row where you have the type column has a value of four, name has a value of C, country is SP and the date is eleven two. So we break down into how these partitions are storing data.
It’s breaking things up and really referencing data elements, especially data elements that are repeated within how the data is actually physically stored. So again, going back to how the data is stored in those parquet files and what is the reference to those data elements through the use of micro partitions.
This is really the brains of what’s happening behind the scenes where generally it’s invisible to your users. And the reason that we say generally invisible is that when you think of all right, here’s how Snowflake works, Snowflake automatically handles all of that for you.
This isn’t always the case for other databases.
If you’re new to Snowflake, you may be coming from a legacy database technology. You might be thinking oh my goodness, I’ve had to add all of these different micro partitions or just partitioning in general. We had to tell the database exactly how we wanted things to be. And so then we had to set all those SQL statements to even get default behavior. Because the default behavior was to do nothing, just go, just leverage hardware that Ram and CPU.
That’s not the case with Snowflake. You have a lot more options at hand. And so when we think of some of these different use cases, we mentioned, hey, I’m querying a large table. I don’t need all 5000 maybe of hundreds of columns of very wide table.
For whatever reason. No judgment on the why you have the table that wide. But if you do, okay, a large table, well that table could be wide. It also could have hundreds of millions. It could have billions, even trillions of records in there.
And you of course are not looking at all the data at once. You’re not going to scan all the data at once either. You have a very particular lens as to the data that you’re looking at. Typically you think of a fact table as being very large.
The queries that you send to the fact table, they’re going to have naturally dimensional types of attributes. And so when you think of how should this data be partitioned or micro partitioned, you can either go leverage the default behavior, but if you’re seeing issues with that, or if you know issues are going to come about, and you want to just make sure get ahead of the schedule or get ahead of the game here, you can then go ahead and apply those micro partitions based upon how your users are really going to consume that data.
Because naturally, that will increase your performance to give them as fast as possible performance with the lowest costs with that. Especially when you think of, from a time series perspective, you’re looking at quarter to date, month to date, year to date, maybe year over year, or quarter over quarter.
You’re trying to look at some of that trend, maybe multi-year or multi-quarter data. You want that data to be subsecond, even though that time series data is being generated on the fly, your micro partitions really help enable that acceptable and delight your users from a performance perspective.
So you think of some of the benefits of this and some of those benefits it’s automatically derived. We talked about that there is the default as part of just creating the table. You don’t have to specify any micro partitioning on there.
It will be based upon the order and the sequence of the data and it’s automatically generated for you. You, of course, can recluster. And that’s where some of those SQL scripts that we saw earlier allow you to apply a very specific partitioning approach, which naturally leads to your faster queries and then is really a faster storage or a more efficient storage of those micro partitions.
Once you apply that clustering there, and whether you’re scanning individual columns or looking at multiple columns, all that data is always compressed. It’s compressed and encrypted at storage. But then when you look at individual columns, they too are compressed within those micro partitions.
When you think of other cloud data warehouses, we mentioned a little bit earlier from a traditional legacy technology, this is all very manual. Typically, oh, there’s a problem of pain. My users are complaining about slow performance, or maybe I notice slow performance, or jobs are taking a very long time.
In order to run, I need to do something. So this is part of the troubleshooting process, which that’s that’s great that other data warehouses let you feel that pain so that you can go address the pain.
But it’d be much nicer to not have a level of pain so that your cloud data warehouse automatically applies this tuning to help mitigate some of the performance issues. But on the flip side. Perhaps performance is okay, but you notice your costs are increasing.
And this is for other cloud data warehouses that we typically have come across where we see, oh boy, our cloud compute for this particular vendor’s data Cloud is really expensive. I don’t know what happened.
Our users ran some crazy queries, or maybe they’re inefficient queries. How do we go and tune that? Well, you use that particular other vendor’s tuning capabilities and you apply the appropriate resources.
You already are on the hook to pay that larger bill with that other cloud data warehouse. And so that ship has sailed like, well, great, let’s not make that mistake a second time. But you’re going to see the dollar amounts stacked up that first time, which are going to cause you to have more of a reactive means.
Snowflake data Cloud has that auto partitioning so that you don’t have the type of experience of oh my goodness, I consumed too many resources and now it’s gotten really expensive. Which really that leads to at the end of the day, from a concurrency perspective, you’re reducing that within these other cloud data warehouses because there’s just a lot of activity and action going on running inefficient queries without partitioning, which of course leads to lack of flexibility and a negative or a poor end-user experience.
The Snowflake flips the script and lets you go ahead and avoid all of these, but then it does provide even additional ways to give you the flexibility and performance that you need at the end of the day.
So, Mike, we have a quick question in the Q and A and I think Christian is going to give a quick demo, too. The question is. When existing data is updated, does Snowflake update partitions accordingly?
Yes, it does. I’ll give an answer, and then, Christian, you can give an answer too, here. Yes, it does. Upon loading data into the table, and each time data is loaded, the partition gets updated. Now, the statement back here that we have right here, you’re able to set your own partition.
So what this would do, it would naturally, if you’re replacing the table, it will drop all of the micro partitions and then recreate them based upon this cluster by that you’ve added to the table. Christian, anything to add to that?
No, I mean, that’s spot on. Right. And one interesting thing about Snowflake, and really any SaaS-based application or platform as a service is Snowflake is a complete control of how this works. So if you follow the change notes, there are some things they’ve changed over the last five years as it relates to micro partitioning, automatic partitioning versus manual.
So for the most part, things work just like you said, Mike. And then, of course, as compute and systems improve, one year from now, it could be completely something a little different. Right. But currently, that’s how we’ve seen it working with customers and with our systems and solutions as well.
Perfect. I think there’s another question out there, if I’m not wrong, Heather, but just real quick, let me see if I can share my desktop. We’ll see if we can just kind of do a super high-level walkthrough.
Yeah. So when you think about micro partitioning and clustering, like Mike said, the micro partitioning is pretty much happening. Ah, by default by, by the Snowflake engine, if you will. So let’s just kind of step through this.
I’m just going to go ahead and kind of walk through what I’m doing here. So just a very basic demo here, just picking my role and my warehouse and the database. And I’m going to go and clean up here the schema that I was working in.
And so here I’m just going to create a super base table called Mygeo, which kind of aligns with the diagram that was in one of the earlier slides. Very basic structure, about four columns, very basic data types.
And the data types can be really any data type. Obviously, something like a variant data type is probably not the best for clustering by, but again, always consult the documentation and so forth. So we’re just using some very simple data types here.
So let’s go and create that table basic. And then I’m just going to add about three rows to this table. And so the micro partitioning concept is already in play. A lot of it’s just really behind the scenes.
Actually, pretty much everything’s behind the scenes. But by using a few commands, we get insight into what’s going on with the clustering to some degree with the partitioning, but we’ll just kind of keep it simple for this particular walkthrough.
So here we’ll just check out the descriptions on the table. So a couple of things you can notice here, obviously we have the description of the table and then you’ll notice here it actually provides us with the clustering and really the clustering type.
But again, not going into great detail. But you see we actually have a column called Type, if that’s not confusing, name, country and My date. And then of course we’ve just got some basic things and you can kind of see here already, right we’ve got automatic clustering.
Turned on basically by default. We love to look at these tables. I recommend anybody who’s a Snowflake user, especially someone who’s technical and has to answer lots of technical questions. You really kind of do investigate some of the show descriptive features of Snowflake as often because it’s really interesting to see it change from a version to version, even not even a major version, like a quarter version, when you don’t think it would change.
So things like change tracking, right? So when is that going to become a full-fledged feature? Obviously, we already know about search optimization and some other things, and obviously the internal versus external flag for the table type.
So then if we wanted to just kind of look at a few things that are kind of descriptive about the clustering itself, right? So if we’re kind of thinking about the average depth of how the table is being clustered based on the specified columns, right, or kind of like the clustering key.
And so it’s really kind of like, if you think about it, it’s like the average depth or how deep the clustering is based on population of the table. So when the number is smaller, when you’re looking at a clustering depth that gives you a smaller number.
It typically means that the tables clustered better. Now, there were some other concepts like clustering ratio. We can go into that and where they stand today, but let’s just look at these two, right?
So let’s look at clustering depth. So we know we have a cluster and it’s a very simple table. I have a one, right? So it’s always going to be more, more than one. But again, a smaller average depth, the better the table is, right?
And so we’ll come back to like, well, what does that mean if it’s a higher number? And what do you do about it? All right, so then let’s look at clustering information. And there’s some other parameters we could pass into this, but for the actual column.
So you could actually look at each, for example, with information, you could pass in different column names. So you could do the schema table, and then the next parameter could be a table name, sorry, a column name, for example.
But what we’re looking at here is we get the average clustering depth. We get a lot of other data in JSON format. So that’s really cool and interesting to think about. Right. It’s coming in JSON format, so there’s a lot you can do with that from like an object parsing perspective.
But we’ll keep it high level for now because this could be a deep topic. And so all this information allows us to then understand the table with greater insight. We can kind of see over here, we can kind of see that full breakout of what we have available to us, right?
We could actually build out a full histogram if we wanted to look at average depth. We can look at partition count and some other details just related to this particular clustering key. And so the question might be, well, if we do have a certain depth in our cluster, what will we do about it?
And so the answer to that, as far as I know anyway, is to kind of take a look at the concept of reclustering. So why is reclustering important? Well, when you create the cluster on the table, the data will continue to be incrementally added to the table.
And so the micro partitions are happening automatically. The clustering is enabled, and so the data rows will continue to increment in your table. So let’s say, for example, you’re adding data with dates so you could enter a row with the year 2007 and then one with 2001 and then one with 2020.
And so you need that data in your table and your cluster is still there. But when you recluster, you’re giving the micro partition the ability to cluster the data in an ordered fashion. So that’s one of the benefits of using the reclustered function here in the Alter table declaration.
And so it’s very interesting, we’d have to do probably a little bit larger of a sample set to actually see that in action. But the core of the concept is there, and it’s really kind of measured by the average depth, the clustering depth, if you think about it.
So that’s a high level of both clustering and kind of how the micro partition is working. And then, of course, how do we analyze that over time as we put more and more data onto that cluster table? And for lack of better time, how do we repartition or reclust that table so that performance that we get from that clustering kind of is maintained as that table grows?
Okay, so I’ll wrap that up there and turn it back to you, Mike. All right, perfect. Does anyone else have any additional questions for today? So we’ve got about four questions in the queue. Oh, nice.
Right now so out the cost. So there’s a question here. Well, I’m sorry, I’m skipping one. What are the best ways to select the key? The key or keys for a partition, or hence a cluster? Adding on to the question.
Next? It’s a great question. I’ll kind of take the first stab at that one. So there’s a couple of different ways you can go about it. Obviously, there’s trial and error, which many people have done with varying degrees of success.
And then of course, another way, it’s actually recommended by by Snowflake. And we’ll put the link, if we can, into the, into the webinar notes, if not this actual chat window here. But basically there’s basically a standard select statement that you can run against the table and that’s going to help you identify what are the best keys to actually then cluster by, so that you can then determine if you need to drop one of the keys or columns from the cluster or add another one.
And again, then reverting also back to the clustering information. The system clustering information will then also help validate that as your data grows or you provide some sort of like, let’s call it a trial run against that particular clustering set of keys that you’ve applied.
And like I said, we’ll find that link to the documentation and put it in here as well. Mike, anything to add to that one? No, that was good. Yeah, I actually found it over here on the side here. In general, it’s all about the filter, right?
So when I said select statement, it’s all about the filtering. The next one is Cost Implications of Partitions. So what are the cost implications of micro partitions, Mike? You would take a stab at that one.
Yeah, I definitely have the generalization from a cost implication relative to when you add micro partitions yourself. Naturally, your cost will be even more optimized than they were with taking the default micro partitions.
But when you think of, oh, is there a cost to the data that’s stored, it’s so statistically insignificant. But really there’s the cost of the time. If you don’t like the default micro partition and you want to add your own or add your own cluster, taking time to figure out what that is and applying that, that’s where your cost of time will come in for that.
So, I don’t know. Christian, is there anything in addition that you would add to that one? Yeah, I mean, Snowflake is all about capacity-based pricing, right? I think we all know this, it’s no shock.
So pretty much anything you do, because again, if you think about it, from the you create table like normal. And when you want the micro partitioning to have the impact with the clustering, you’re going to either create the table with the cluster keys or you’re going to have a table and you say, hey, this table could be more performant.
And then you’re going to try to add the cluster keys. So you’ll run an Alter table and you’ll add the cluster keys, right? That’s typical. There’s lots of things you can do in Snowflake, right? You can do your copy and everything, but standard SQL you’re going to do.
You create your table with the cluster keys or you create your table. Find out you need the cluster keys, you run an Alter table statement, add to cluster keys. And so that standard operation. Call it the first step.
Obviously, there are some credits consumed in just going through that process of having the table, getting the data in the table, and then once you then recluster, that’s going to consume other credits.
When I think of micro partitioning, I just think of the clustering concept because the micro partitions, like Mike said, are really just. For the most part. I’m doing air quotes, but they’re auto, right?
It’s the reclustering that helps kind of organize those micro partitions and such. So anything you do, there’s going to be a cost impact, whether it’s the compute or obviously there’s a storage implication there as well.
So there are some costs there for storage. Also. Mike, another question here. Do we need to update statistics on the partitions or are they done automatically? Way statistics? Somebody must be coming from an Oracle background.
Yes, definitely Oracle or even the teaser. Definitely. Statistics are very popular and common within Snowflake, though it’s done automatically for you. A lot of it is done within the cloud services layer of the Snowflake architecture, but it’s not something that you need to update yourself.
Micro partitions, those are created by default. If you add additional clustering there at that point, you definitely are updating the metadata, but you’re purposely updating that for your individual situation.
But there’s nothing else to do outside of what I mentioned there. It’s not something that you can interact with in order to say, oh, what are the statistics? It’s a concept that really isn’t as exposed as it is in a classically done manner as an Atiza or Oracle.
Christian, anything to add to that one? Yeah, that’s good. That’s spot on. Yeah. He says he is from an Oracle background. Perfect. No worries. Great question. We, Christian I each have a very deep Oracle background as well before we got into Snowflake six years ago.
Next one up. So he says he’s clustered, insert overwrite with an order by a table and decrease the total number of partitions. Is that because the partitions were overlapping and the data being stored?
That’s a good question. I don’t know if you understood the question, Mike, but I think I read it correctly. I think that the gist is there was clustering on a table insert override with an order by, and it decreased the total number of partitions.
Is that because the partitions were overlapping and the data being stored? I don’t know if I could answer that one well enough to say that that’s factually what was happening, but it could be right? Because the whole idea of, like especially if you’re reclustered, I think is what he said,
Because the whole idea of a recluster is that it’s going to clean it up. I’m using that as like a high-level solving term. It’s a very technical term. It cleaned up your micro partition so that ordering could have, based on this kind of idea of a min and a max, it could have cleaned up that and increased performance that way.
So I’m not sure I’m answering that question or understanding it fully. But yeah, there’s some benefits there. I think, just in general, if there’s a table that anyone looks at and let’s say it’s got a few columns or it has a large amount of table and upon querying, particularly with a filter right, a where clause.
Of some sort of predicate. And you kind of look at that table and say, I don’t know why this table is performing slowly, even on a small warehouse. Then. As a Snowflake practitioner, one of the first steps should be, hey, let’s look at clustering keys on this thing and see what it’ll do.
And then obviously, once you do the clustering and you start there’s, data flowing in, your depth is going to change. So your clustering depth needs to be monitored and then adjusted accordingly. So there’s maintenance there.
I think one of the cool things, and we were going to try to put this into the demo, I think we just ran out of time, but maybe a future session was to look at the Snowflake tasks. And so you would have a table that’s incrementally adding rows to the table.
Then you have a task that’s running every whatever, probably for the demo, every 1 minute. But let’s just say every hour in a real-world situation, maybe every day, and it’s looking at that table, it’s running the clustering depth.
If your depth is greater than whatever, that’s your threshold. Let’s say the depth can range between one to ten, and you look at your clustering depth and it’s at a five, then the task automatically runs the recluster for you.
And so on a daily basis, your table is automatically kind of cleaning itself up. All right, let’s see. And I think one of the last questions here is, Mike, how do these partition keys, how are they managed in terms of updates and deletes?
When you say, just like what happens to the keys when you add more? When you add more, it automatically gets added. So that’s just a built-in feature. I can only go on the theoretical presuming the deletes get removed.
That’s a very general, vague way to answer the question. Yeah, yeah, same. You know, if you, if you drop a key or add a key I mean, obviously the Snowflake engine’s basically taken over at that point.
I’m not even sure if there’s any these are great questions I’ve never tracked like history to see because we’ve done some of this where we drop a key, we add a key, but for the most part we kind of move on, right?
So this is part of Snowflake where I’d say, well, it’s automatically done, but it is a bit magical because well, can I view or see what’s done? There is no way to get to that uber-level of granularity.
It just happens. And that’s part of Snowflake would say, oh, that’s part of our self-tuning, simple-to-use data warehouse platform. And so there’s a bit of that leap of faith to say, okay, I’m trusting that it’s happening.
I know that isn’t a warm fuzzy, but that’s one of those maybe catch-22 at times with something that is so self-tuning and self-managed. Yes. And there are small things like that, right? Because I think there’s somewhere in the Snowflake docs where it talks about even when you do the cluster key, like let’s say you’re to add a new key, you drop one key and you add another.
It doesn’t affect your table as quickly. It’s not immediate. Right? Immediate could vary for Snowflake, right? Because you could be in a particular Snowflake account where you do this and 5 seconds later your whole table is refreshed and it’s impacted and you could be in another account and it could take a bit longer.
I don’t know what longer is because it’s all behind the scenes, right? But, in the Snowflake engine. So that’s something also as well. If you did drop one and you added one or two and you’re expecting faster results immediately, you might have to give it a few moments and then make sure that those keys are actually added properly and affect the rows in the table.
And I think the only way you’ll see that is through the actual querying. Or you could check the clustering information as well, I think. All right. And then will there be any downtime or performance issues while reclustering?
Yeah, I didn’t even see that question, but I think I kind of answered that a little bit. The Snowflake engine is automatically doing this kind of auto clustering thing behind the scenes, kind of at all times, right, to some degree.
And I think I read somewhere a while back also that if you add a cluster key, but the Snowflake clustering engine doesn’t think it’s going to help the table at all, it has the potential to not have that have an impact on the micro partition.
So I think there are just some things there that are just going to be your mileage may vary type things, right? Where there will be some trial and error and there might be a point of a point where you’re not going to get that table to perform any better.
Right through clustering or through the micro partition or auto clustering. You’re going to have to come up with an increase in the warehouse or something like that. But it’s definitely a great solution as we’re talking about here.
But again, your mileage might vary, and again, definitely refer to the docs on some of those finer details, I would suggest. I agree with that. That’s good. Yeah. Got you. All right. Well everybody, fantastic questions here.
Thank you so much for asking them as always here. Let’s get back into let’s close out the presentation and some of you may have additional questions that have come up or will come up always. We’d love to help, love the chat.
Feel free to ping us through our website or our social media. It’s always good to chat with others. And some of this stuff there isn’t always a perfect manual for it definitely some situational things.
So that’s why we offer up the free time to get to know you and get to know the situation and provide our advice and thoughts on that. So we think of just some other upcoming things. Snowflake has this time of year.
The data for breakfast all around the globe. Lots of different places in major cities and major locations where Snowflake has a people presence, which at this point seems to be just about everywhere just given how large of an organization they are.
But definitely check out the data for breakfast. You get to meet with other people in your local area. You get to meet naturally those other people are going to be current clients of Snowflakes. They’re going to be Snowflake technical and salespeople.
And then you also get to hear from other Snowflake customers. Usually, there’s one headline Snowflake customer that is talking about their Snowflake journey. So it’s all about them and their Snowflake journey.
It’s not listening to a sales pitch. It really is you’re going to find a lot of value in there and especially when you get to meet with your peers and doing some networking. That opens up lots of doors for you on your company and individual perspective as well.
We do have roughly a month from now we have another webinar but talking about sending emails and communications out via Snowflake. This is. Not an uncommon area that we get questions of how to and how should I and how can I do certain things with bursting out communications based upon certain events within Snowflake itself.
So head to our website, or I should say, head to the meetup group. You’ll find the registration information out there and we look forward to seeing you there at the latest, if not sooner. So thank you, everybody, for joining us today.
We’ll go ahead and make this available via our YouTube channel in the next day or so, but otherwise, feel free to reach out to us. And thank you everybody and have a good night.