Database Sharding: Part 1

Oracle University Podcast

Aug 20 2024 • 22 mins

In this two-part episode, hosts Lois Houston and Nikita Abraham are joined by Ron Soltani, a Senior Principal Database & Security Instructor, to discuss the ins and outs of database sharding. In Part 1, they delve into the fundamentals of database sharding, including what it is and how it works, specifically looking at Oracle Database Sharding and its benefits. They also explore the architecture of a sharded database, examining components such as shards, shard catalogs, and shard directors.
Oracle University Learning Community: https://education.oracle.com/ou-community
Special thanks to Arijit Ghosh, David Wright, and the OU Studio Team for helping us create this episode.
--------------------------------------------------------
Episode Transcript:

00:00

Welcome to the Oracle University Podcast, the first stop on your cloud journey. During this series of informative podcasts, we’ll bring you foundational training on the most popular Oracle technologies. Let’s get started!

00:26

Nikita: Hello and welcome to the Oracle University Podcast. I’m Nikita Abraham, Principal Technical Editor with Oracle University, and with me is Lois Houston, Director of Innovation Programs.

Lois: Hi there! The last two weeks of the podcast have been dedicated to all things database security. We discussed why it’s so important and looked at all the new features related to database security that have been released in Oracle Database 23ai, previously known as 23c.

00:55

Nikita: Today’s episode is also going to be the first of two parts, and we’re going to explore database sharding with Ron Soltani. Ron is a Senior Principal Database & Security Instructor with Oracle University. We’ll ask Ron about what database sharding is and then talk specifically about Oracle Database Sharding. We’ll look at the benefits of it and also discuss the architecture.
Lois: All this will help us to prepare for next week’s episode when we dive into each 23ai new feature related to Oracle Database Sharding. So, let’s get to it. Hi Ron! What’s database sharding?

01:32

Ron: This is basically an architecture to allow you to divide data for better computing and scaling across multiple environments instead of having a single system performing the work. So this allows you to do hyperscale computing and other different technologies that are included that will allow you to distribute your queries and all other requests across these multiple components to be able to get a very fast response.
Now many times with this distributed segment across each kind of database that is called a shard allow you to have some geographical location component while you are not really sharing any of the servers or the components. So it allows you separation and data management for each of the shards separately. However, when it comes to the application, the sharded database is totally invisible. So as far as the application is concerned, they connect to a global service, submit their statements. Everything else is managed then by the sharded database underneath.

With sharded tables, basically it gets distributed across each shard. Normally, this is done through horizontal partitioning. And then the data depending on the partitioning scheme will be distributed across like server A, server B, server C, which are independent servers that are running independent databases.

03:18

Nikita: And what about Oracle Database Sharding specifically?

Ron: The Oracle Database Sharding allows you to automate how the data is distributed, replicated, and maintain the kind of a directory that defines the complete sharding scheme, while everything is distributed across many servers with no sharing whether the hardware or software. It allows you to have a very good scaling to be able to scale based on this partitioning across all of these independent servers.

And based on the subset and the discrete data configuration, you can go ahead and distribute this data across these components where each shard is an independent data location or data component, a subset of data that can be used, whether individually on its own or globally across all of the shards together. And as we said to the application, the Oracle Database Sharding also looks as a single component.

04:35

Lois: Ron, what are some of the benefits of Oracle Database Sharding?

Ron: With Oracle Database, you basically have linear scaling capability across as many shards as you like. And all of the different database configurations are supported with this. So you can have rack databases across the shards, Oracle Data Guard, GoldenGate. So all of the different components are still used to give you all of the high availability and every other kind of functionality that we generally used to having a single database with.

It provides you with fault toleration. So each component could be down. It could have its own replicated data. It doesn't affect other location and availability of the data in those other locations.

And finally, depending on data sovereignty and configuration, you could actually distribute data geographically across the different locations based on requirements and also data access to provide a higher speed for local data management.

05:46

Lois: I’d like to understand more about the architecture of Oracle Database Sharding. Ron, can you first give us a broad overview of how Oracle Database Sharding is structured?

Ron: When it comes to dealing with Oracle Database architecture, the components include, first, your shards. The shards-- each one is an independent Oracle Database depending on the partitioning you decide on a partition key and then how the actual data is divided across those shards.

06:18

Nikita: So, these shards are like separate pieces of the database puzzle…Ok. What’s next in the architecture?

Ron: Then you have shard catalog. Shard catalog is a catalog of your sharding configuration, is aware of all of the components in the shard, and any kind of replicated object that master object exists in the shard catalog to be maintained from there.
And it also manages the global queries acting as a proxy. So queries can be distributed across multiple shards. The data from the shards returned back to the catalog to group together and then sent back to the client.

Now, this shard catalog is basically another version of an Oracle Database that is created independently of the shards that include the actual data, and its job is to maintain this catalog functionality.

07:19

Nikita: Got it. And what about the shard director?
Ron: The shard director is like another form of a global service manager.

So it understands the sharding by being able to access the catalog, knows where everything exists. The client connection pool will hit the shard director. In general, communication and then whether it's being distributed to the shard catalog to be able to proxy it, or, if the key is available, then the director can send the query directly to the shard based on the key where the data exists. So the shard can then respond to the client directly. So all of the connection pool and the components for global administration, generally managed by the shard director.

08:11

Nikita: Can we dive into each of these components in a little more detail? Let’s go backwards and start with the shard director.

Ron: The shard director, as we said, this is like a global service manager. It acts as a regional listener where all of the connection requests will be coming to the shard director and then distributed from that depending on the type of connection that is being used.

Now the director understands the topology--maintains the complete understanding of the mapping of the data against the shards. And based on the shard key, if the request are specified on the specific key, it can then route the connection request directly to the shard that is appropriate where the data resides for the direct response.

09:03

Lois: And what can you tell us about the shard catalog?

Ron: The shard catalog, this is another Oracle Database that is created for special purpose of holding the topology of the sharded database. And have all of the centralized information metadata about your sharded database. It also act as a proxy.
So, if a client request comes in without providing a shard key, then the request would go to the catalog. It can be distributed to all of the shards. So the shards that you actually have the data can respond, but the data can then be combined and sent back to the client. So, it also creates the master copy of all the duplicate tables that are created in the shard database.

09:56

Lois: Ok. I’ve got it. Now, let’s talk more about the shards themselves.

Ron: Each shard is basically a database. And data is horizontally partitioned to be placed on each of these shards. So, this physical database is called the shard. And depending on the topology of your sharding, there could be user sharding, for example, where multiple keys are in a single shard or could be a system sharding that based on the hash value data is distributed whether singly or multiple data components across each shard.
Now, this is completely transparent to the application. So, as far as application is concerned, this is a single database and the response everything that they do is generally just operating as a single database interaction.

However, when it comes to the administrators, each shard is a separate database. Each shard can be managed independently and can have its own standby and other components that is then set up for high availability and management of the data operations.

11:21

Do you have an idea for a new course or learning opportunity? We’d love to hear it! Visit the Oracle University Learning Community and share your thoughts with us on the Idea Incubator.

Your suggestion could find a place in future development projects! Visit mylearn.oracle.com to
get started.

11:41

Nikita: Welcome back! Let’s move on to global services and the various sharding methods. Ron, can you explain what global services are and how they function in a sharded database?
Ron: Global services is generally the service that is used for the application to be able to connect to the sharded database. This is provided and supported through the shard director. So clients are routed using this global service.

12:11

Lois: What are the different sharding methods that are available?

Ron: When it comes to sharding methods that were available, originally we started with the system sharding, which is a hash partition, basically data is distributed evenly across the shards. Then we needed to allow for the user-defined sharding because sometimes it's not about just distributing the data evenly, it's also about controlling where the data goes to be able to control individual query execution based on the keys. And even for data sovereignty and position of the data itself.

And then a composite sharding, which provides you kind of a combination of the user-defined sharding and the system hash sharding that gives you a little bit of a combination of the two to better distribute your data across the shard.

And finally, sub-partitioning all types of sub-partitionings are supported to provide a better structure of the data depending on the application schema design.

13:16

Nikita: Ron, how do clients typically connect to a sharded database?

Ron: When it comes to the client connections, all the client connections are generally routed to the director and then managed from there. So there are multiple ways that clients can connect. One could be a direct connect. With a direct connect, they're providing the shard key in the request. Therefore, the director knowing the topology can route the client directly to the shard that has the data.

The proxy routing is done by the catalog. This is when generally a shard key is not provided or data is requested from many shards. So data will then request is then sent to the catalog. The catalog database will then distribute the query to the shards, collects the results, and then combine sending it back to the client acting as a proxy sitting in the middle.

And the middle tier routing, this is when you can expose the middle tier to the structure of your sharding. So when the middle tier send the request, the request identifies which shard the data is going to. So take advantage of that from the middle tier. So the data is then routed properly. But that requires exposing the structures and everything in the middle tier.

14:40

Lois: Let’s dive a bit deeper into direct routing. What are the advantages of using this method?

Ron: With the client request routing, as we talked about the direct routing, this allows the applications to get very quick data access when they know the key that is used for the distribution of the data. And that is used to access the data from the shard.
This provides you a direct connection to a shard from the shard director. And once the connection is established, then the queries can get data directly across the shard with the key that is supplied. So the RAC respond for that particular subset of data with the data request.

Now with the direct routing again, you get some advantages. The advantage is you have much better performance for capturing subset of the data because you don't have to wait for every shard to respond for a particular query.

If you want to distribute data geographically or based on the specific key, of course, all of that is perfectly supported. And kind of allows you to now distribute your query to actually the location where the data exists.

So for example, data that is in Canada can then be locally accessed in Canada through this direct access. And of course, when it comes to management of your client connection, load balancing of those connections. And of course, supporting all types of queries and application requests.

16:18

Nikita: And what about routing by proxy?

Ron: The proxy routing is when queries do not supply the actual sharding key, where identifies which shard the data reside. Or the actual routing cannot be properly identified. Then the shard director will send the request to the catalog performing the work as the proxy.

So proxy will then send a request to all of the shards. If any shards can be eliminated, would be. But generally all of the shards that could have any portion of the data will then get the request. The requests are then sent back to the proxy. And then the proxy will then coordinate the data going back and forth between the client.

And the shard catalog basically hands this type of data access to the catalog to act as the proxy. And then the catalog is-- the shard director is no longer part of the connection management since everything is then handled by the shard catalog itself.

17:37

Lois: Can you explain middle tier routing, Ron?

Ron: This generally allows you to use the middle tier to define which shard your data is being routed into. This is a type of routing that can be used where the data geographically have some sovereignty or the application is aware of the structure.
So the middle tier is exposed to the sharded database topology. So understand exactly what these components are based on the specific request on the shard key, then the middle tier can then route the application to the appropriate location for the connection.

And then the middle tier, and then the either one shard or the subset of shard will maintain those connections for the data access going back and forth since the topology is now being managed by the middle tier. Of course, all of the work that is done here still is known in the catalog, will be registered in the catalog. So catalog is fully aware of any operations that are going on, whether connection is done through middle tier or through direct routing.

18:54

Nikita: Ron, can you tell us how query execution and DDL operations work in a sharded database?

Ron: When it comes to the query execution of the application, there are no changes, no requirement for identifying specifically how the data is distributed. All of that is maintained behind the scene based on your sharding topology.

For the DDL, most of your tables, most of the structures work exactly the same way as it did before. There are some general structures that are associated to the sharded database that we will originally create and set up with mapping. Once the mappings are configured, then the rest of the components are created just like a regular database.

19:43

Lois: Ok. What about the deployment process? Is it complicated to set up a sharded database?

Ron: The deployment for the sharded database is fully automated using Terraform, Kubernetes, and scripts that are put together. Basically what you do is you provide some of your configuration information, structure of your topology through an input file, like a parameter file type of a thing.

And then you execute the scripts and then it will build everything else based on the structure that you have provided.

20:19

Nikita: What if someone wants to migrate from a non-sharded database to a sharded database? Is there support for that?

Ron: If you are going to migrate from a regular database to a sharded database, there are two components that are fully shard aware.

First, you have the Shard Advisor. This can look at your current structure, the schema, how the data is distributed. And the workload and how the data is used to give you recommendation in what type of sharding would work best based on the workload.

And then Data Pump is fully aware of the sharding component. Normally, we use Data Pump and load into each of the databases individually on its own. So instead of one job having to read all the data and move data across many shards, data can be loaded individually across each shard using Data Pump for much faster operations.

21:18

Lois: Ron, thank you for joining us today. Now that we’ve had a good understanding of Oracle Database Sharding, we’ll talk about the new 23ai features related to this topic next week.

Nikita: And if you want to learn more about what we discussed today, visit mylearn.oracle.com and search for the Oracle Database 23ai New Features for Administrators course. Until next week, this is Nikita Abraham…

Lois: And Lois Houston signing off!

21:45

That’s all for this episode of the Oracle University Podcast. If you enjoyed listening, please click Subscribe to get all the latest episodes. We’d also love it if you would take a moment to rate and review us on your podcast app. See you again on the next episode of the Oracle University Podcast.