Google’s Planet Scale Distributed Storage Patents

This past February, Google filed for a number of patents that describe aspects of how it might share information from one data center to another, and some of the challenges that entails. Google’s Yonatan Zunber, who revealed on his blog that over the past few months that he was the chief architect for Google’s social systems, including Google Plus, is one of the inventors listed on a number of the patents.

Just how are the nuts and bolts of Google’s data architecture pieced together, from its Web index to storing emails and photos, from user profiles and posts and responses in Google Plus to maps and photos and streetview images in Google Maps and Google Earth? Google has a number of data centers around the globe. How does the search giant efficiently move data from one data center to another? How does it back up the files and indexes that it uses? Where does all the user data go that Google collects when people search and browse the Web?

Google has shared some information about how they store and access data over the years in papers and articles like:

The patents provide some insights into how Google might manage moving large amounts of data from one location to another. For example, instead of sending a large amount of data from one data center to another, the system described in the patents might track changes to individual pieces of data as “deltas,” and those deltas are transmitted from one data center to another instead of sending all of the data.

If you’re interested in how Google may be attempting to handle the flow of data from one data center to another, you may want to spend some time with these patent filings.

Method and System for Efficently Replicating Data in Non-Relational Databases
Invented by Yonatan Zunger
US Patent Application 20110196827
Published August 11, 2011
Filed: February 9, 2010

Abstract

A method replicates data between instances of a distributed database. The method identifies at least two instances of the database at distinct geographic locations. The method tracks changes to the database by storing deltas.
Each delta has a row identifier that identifies the piece of data modified, a sequence identifier that specifies the order in which the deltas are applied to the data, and an instance identifier that specifies where the delta was created. The method determines which deltas to send using an egress map that specifies which combinations of row identifier and sequence identifier have been acknowledged as received at other instances. The method builds a transmission matrix that identifies deltas that have not yet been acknowledged as received.

The method then transmits deltas identified in the transmission matrix. After receiving acknowledgement that transmitted deltas have been incorporated into databases at other instances, the method updates the egress map.

Executing Replication Requests for Objects In A Distributed Storage System
Invented by Alexander Kesselman
US Patent Application 20110196836
Published August 11, 2011
Filed: February 9, 2011

Abstract

A system and method for executing replication requests for objects in a distributed database is provided. A plurality of replication requests for objects in a distributed storage system is received. The replication requests are partitioned into one or more replication queues. A respective replication queue includes replication requests that have a respective replication key. The respective replication key includes information related to at least a respective source storage device at which a respective object is located and a respective destination storage device to which the respective object is to be replicated.

For each respective replication queue, the replication requests in the replication queue are sorted based on priorities of the replication requests. Commands to execute a highest priority request are issued in each respective replication queue. When a respective replication request is completed, the respective replication request is deleted from the replication queue.

Storing Replication Requests for Objects In A Distributed Storage System
Invented by Alexander Kesselman;
US Patent Application 20110196834
Published August 11, 2011
Filed: February 9, 2011

Abstract

A system and method for storing replication requests for objects in a distributed storage system is provided. A plurality of replication requests for objects stored on storage devices in a distributed storage system is received. Respective row keys are generated for respective replication requests in the plurality of replication requests based on parameters of the respective replication requests.

The respective row keys include respective globally-determined priorities for the respective replication requests that facilitate sorting of the respective replication requests by priority. The respective replication requests are stored in respective records of a distributed database using the respective row keys, wherein the respective records of the distributed database are distributed across a plurality of nodes of the distributed database.

Method and System for Dynamically Replicating Data Within A Distributed Storage System
Invented by Alexandre Drobychev, Alexander Kesselman, Rebekah C. Vickrey, Frank C. Dachille, George Datuashvili
US Patent Application 20110196828
Published August 11, 2011
Filed: February 7, 2011

Abstract

A server computer at a first storage sub-system of a distributed storage system receives from a client a first client request for an object. If the object is not present in the first storage sub-system, the server computer identifies a second storage sub-system of the distributed storage system as having a replica of the requested object, the requested object including content and metadata.

The server computer submits an object replication request for the requested object to the second storage sub-system and independently receives the content and metadata of the requested object from the second storage sub-system. The server computer generates a new replica of the object at the first storage sub-system using the received metadata and content and returns the metadata of the new replica of the object to the client.

System and Method for Replicating Objects In A Distributed Storage System
Invented by Alexander Kesselman
US Patent Application 20110196873
Published August 11, 2011
Filed: February 7, 2011

Abstract

A system and method for inserting an object into a distributed database is provided. An object to be inserted into a priority queue is received, wherein the object includes a unique identifier and a priority. Next, an index for the object is generated. A row name for the object is then generated based on the index, the priority of the object, and the unique identifier of the object, wherein a lexicographical order of the row name for a higher priority object is smaller than the lexicographical order of the row name for a lower priority object. The object is then inserted into a row of a distributed database using the row name.

System and Method for Managing Replicas of Objects In A Distributed Storage System
Invented by Yonatan Zunger, Alexandre Drobychevm Alexander Kessleman, Rebekah C. Vickrey, Frank C.Dachille, and George Datuashvili
US Patent Application 20110196830
Published August 11, 2011
Filed: February 7, 2011

Abstract

A system and method for generating replication requests for objects in a distributed storage system is provided. Replication requests for objects in a distributed storage system are generated based at least in part on replication policies for the objects and a current state of the distributed storage system, wherein a respective replication request for a respective object instructs a respective instance of the distributed storage system to replicate the respective object so as to at least partially satisfy a replication policy for the respective object, wherein a respective replication policy includes criteria specifying at least storage device types on which replicas of object are to be stored. At least a subset of the replication requests is then distributed to the respective instances of the distributed storage system for execution.

Method and System for Managing Weakly Mutable Data In A Distributed Storage System
Invented by Yonatan Zunger, Alexandre Drobychevm Alexander Kessleman, George Datuashvili, and Zia Syed
US Patent Application 20110196838
Published August 11, 2011
Filed: February 9, 2011

Abstract

A method for managing multiple generations of an object within a distributed storage system is implemented at a computing device. The computing device receives metadata and content of a first generation of an object from a first client connected to the distributed storage system and stores the first generation’s metadata and content within a first storage sub-system. The computing device receives metadata and content of a second generation of the object from a second client connected to the distributed storage system and stores the second generation’s metadata and content within a second storage sub-system. The computing device independently replicates the first generation’s metadata and content from the first storage sub-system to the second storage sub-system and replicates the second generation’s metadata and content from the second storage sub-system to the first storage sub-system such that both storage sub-systems include a replica of the object’s first and second generations.

Executing Prioritized Replication Requests for Objects In A Distributed Storage System
Invented by Alexander Kesselman
US Patent Application 20110196835
Published August 11, 2011
Filed: February 9, 2011

Abstract

A system and method for executing replication requests for objects in a distributed storage system is provided. A replication queue is identified from a plurality of replication queues corresponding to a replication key. The replication key includes information related to at least a source storage device in a distributed storage system at which objects are located and a destination storage device in the distributed storage system to which the objects are to be replicated. A distributed database is scanned using an identifier of the replication queue to produce a list of replication requests corresponding to the replication queue. The records of the distributed database are distributed across a plurality of nodes of the distributed database. The replication requests in the list of replication requests are executed in priority order. Replication requests are deleted from the distributed database only when the replication requests are complete.

Operating On Objects Stored In A Distributed Database
Invented by Alexander Kesselman
US Patent Application 20110196882
Published August 11, 2011
Filed: February 9, 2011

Abstract

A system and method for operating on objects stored in a distributed database is provided. Rows of a distributed database that correspond to an index are identified. The identified rows are sorted lexicographically based on row names of the identified rows. The sorted rows are ordered by priorities of objects corresponding to the sorted rows. The objects corresponding to the sorted rows are operated on in priority order. In some embodiments, the objects are replication requests for replicating data in a distributed storage system, and operating on the objects corresponding to the sorted rows in priority order includes executing the replication requests in priority order to replicate data in the distributed storage system.

Method and System for Providing Efficient Access to a Tape Storage System
Invented by Rebekah C. Vickrey, Frank C.Dachille, Stefan V. Gheorghita, and Yonatan Zunger
US Patent Application 20110196829
Published August 11, 2011
Filed: February 8, 2011

Abstract

A method for asynchronously replicating data onto a tape medium is implemented at one or more server computers associated with a distributed storage system and connected to a tape storage system. Upon receiving a first request from a client for storing an object within the tape storage system, a server computer stores the object within a staging sub-system of the distributed storage system and provides a first response to the requesting client. If a predefined condition is met, the server computer transfers objects from the staging sub-system to the tape storage system. For each transferred object, the server computer adds a reference to the object to a tape management sub-system, identifies a corresponding parent object associated with the object and its metadata within a parent object management sub-system of the distributed storage system, and updates the parent object’s metadata to include the object’s location within the tape storage system.

Storage of Data In A Distributed Storage System
Invented by Alexandre Drobychev, Alexander Kesselman, Rebekah C. Vickrey, Frank C. Dachille, and George Datuashvili
US Patent Application 20110196900
Published August 11, 2011
Filed: February 8, 2011

Abstract

A distributed storage system stores data for files. A first blob (binary large object) of data is received. The first blob is split into one or more first chunks of data. Content fingerprints for the first chunks of data are computed. The first chunks of data are stored in a chunk store while and their content fingerprints are stored in a store distinct from the chunk store. A second blob of data is received. The second blob is split into one or more second chunks of data. Content fingerprints for the second chunks of data are computed. Then for a second chunk of data whose content fingerprint matches a content fingerprint of a first chunk of data, a second reference to the corresponding first chunk of data that has a matching content fingerprint is stored, but the second chunk of data is not stored.

Storage of Data In A Distributed Storage System
Invented by Alexandre Drobychev, Alexander Kesselman, Rebekah C. Vickrey, Frank C. Dachille, and George Datuashvili
US Patent Application 20110196833
Published August 11, 2011
Filed: February 8, 2011

Abstract

A distributed storage system has multiple instances. There is a plurality of local instances, and at least some of the local instances are at physically distinct geographic locations. Each local instance is configured to store data for a non-empty set of blobs in a plurality of data stores having a plurality of distinct data store types. In addition, each local instance stores metadata for the respective set of blobs in a metadata store distinct from the data stores. There is also a plurality of global instances. Each global instance is configured to store data for zero or more blobs in zero or more data stores and store metadata for all blobs stored at any local or global instance. The system selects one global instance to run a replication module that replicates blobs between instances according to blob policies. Some systems also include dynamic replication based on user needs.

Method and System For Uploading Data Into A Distributed Storage System
Invented by Yonatan Zunger, Alexander Kesselman, and Alexandre Drobychev
US Patent Application 20110196822
Published August 11, 2011
Filed: February 8, 2011

Abstract

A method for uploading an object into a distributed storage system is implemented at a computing device The computing device splits an object into one or more chunks and uploads the one or more chunks into the distributed storage system. For each uploaded chunk, the computing device receives a write token from the distributed storage system, inserts an entry into an extents table of the object for the chunk in accordance with the received write token and the chunk ID, chunk offset, and chunk size of the chunk, generates a digest of the extents table, the digest representing the one or more chunks that the client expects to be within the distributed storage system, and sends the digest of the extents table to the distributed storage system. The distributed storage system is configured to use the digest to determine whether it has each of the one or more client-expected chunks.

System and Method for Determining the Age of Objects in the Presence of Unreliable Clocks
Invented by Alexander Kesselman, Alexandre Drobychev, and Daniel J. Ford
US Patent Application 20110196901
Published August 11, 2011
Filed: February 7, 2011

Abstract

A system and method for determining an age of an object is provided. A first index for a timestamp entry in a sequence of timestamps corresponding to a time at which an object was created is identified. At least one subsequence of timestamps from the sequence of timestamps having indexes for entries in the sequence of timestamps that are between the first index in the sequence of timestamps and a last index for a last timestamp entry in the sequence of timestamps is identified, wherein the at least one subsequence of timestamps conforms to a function of a time interval between storage of consecutive current timestamps reported by clock of the computer system. Timestamps from the sequence of timestamps that are not included in the at least one subsequence of timestamps are removed. An age of the object is determined based on the at least one subsequence of timestamps.

Location Assignment Daemon (LAD) For A Distributed Storage System
Invented by Yonatan Zunger, Alexander Kesselman, Alexandre Drobychev, Rebekah C. Vickrey, Frank C. Dachille, and George Datuashvili
US Patent Application 20110196832
Published August 11, 2011
Filed: February 7, 2011

Abstract

A system and method for generating replication requests for objects in a distributed storage system is provided. For a respective object in a distributed storage system the following is performed. Replication policies for the object that have not been satisfied are determined. Replication requests are ranked for the object whose replication policies have not been satisfied based on a number of replicas of the object that need to be created in order to satisfy the replication policies for the object. Replication requests are generated for the object based at least in part on the replication policies for the object that have not been satisfied and on a current state of the distributed storage system. At least a subset of the replication requests for the objects in the distributed storage system are distributed to respective instances of the distributed storage system corresponding to the replication requests for execution.

Location Assignment Daemon (LAD) Simulation System and Method
Invented by Yonatan Zunger, Alexander Kesselman, Alexandre Drobychev, Rebekah C. Vickrey, Frank C. Dachille, and George Datuashvili
US Patent Application 20110196664
Published August 11, 2011
Filed: February 7, 2011

Abstract

A system and method for simulating a state of a distributed storage system is provided. A current state of a distributed storage system and replication policies for the objects in the distributed storage system is obtained. Proposed modifications to the current state of the distributed storage system are received. The state of the distributed storage system is simulated over time based on the current state of the distributed storage system, the replication policies for the objects in the distributed storage system, and the proposed modifications to the current state of the distributed storage system. Then reports relating to the time evolution of the current state of the distributed storage system are generated based on the simulation.

Pruning of Blob Replicas
Invented by Yonatan Zunger, Alexander Kesselman, Alexandre Drobychev, Rebekah C. Vickrey, Frank C. Dachille, and George Datuashvili
US Patent Application 20110196831
Published August 11, 2011
Filed: February 7, 2011

Abstract

A system and method generating and distributing replica removal requests for objects in a distributed storage system is provided. Replica removal requests for objects in a distributed storage system are generated based at least in part on replication policies for the objects. A respective replica removal request instructs a respective instance of the distributed storage system to remove a respective replica of the respective object so as to at least partially satisfy replication policies for the respective object. Then the replica removal requests for the objects in the distributed storage system are distributed to respective instances of the distributed storage system corresponding to the replica removal requests for execution.

Share

4 thoughts on “Google’s Planet Scale Distributed Storage Patents”

  1. When you think of how much data Google has, it is insane. The amount of research compiled in billions of images, search patterns/usage, text, geographic data, etc. They are certainly the leader in user data and how people interact on the web, we’ll see where a lot of their social media heads in the future. It’s going to be very interesting!

  2. Hi Mason,

    I’ve been wondering for a few years how Google might transfer it’s index and updates and information from different services from one data center to another, and hadn’t really considered the need to do so as quickly as possible so that people can connect to that information from the data center closest to them.

    The video I linked to in the post, “Transactions Across Datacenters (and Other Weekend Projects)” is definitely worth watching when it comes to learning more about the real life challenges in sending data on a very large scale to different places across the planet.

  3. Rather selfishly I’ve never really given much thought to the way in which Google transfers data across it’s data centres. It’s certainly an interesting proposition though – I can imagine the need to continually develop different ways in which to make the process more efficient and I can certainly sympathise with the engineers responsible for delivering the system!

    The prioritised replication requests patent sort of reminds me of what i think was called Quantum Field Theory – the replication of particle movement over long distances – but I guess we’re a long way from harnessing that one!

    Thanks for the post Bill – you’ve managed to get me thinking about everything from Quantum Physics to the computational benefits of arrays!

    Tom

  4. Hi Tom,

    I was reading a series of blog posts from someone who was upset because Comcast suspended his internet access account a few weeks ago because of the very high levels of bandwidth he was using. He shared his internet access with a number of roommates, and they did use streaming music and videos services frequently, but not to the point where he thought it was outrageous. Then he started looking closer at some of the other things that he was doing. He had started storing all his music files on the Web, and his digital photographs, in an uncompressed RAW format. He was also using a cloud backup service to backup files on his hard drive on a regular basis. All the transfers of data were counting in his bandwidth usage – using the Web to store all that data instead of his hard drive, or in addition to it was something that he hadn’t considered being the kind of traffic that Comcast might be monitoring.

    I’ve taken it for granted how much data might have to be moved from one data center to another by Google, and the costs that might be associated with that. I know that Google has been investing in fiber for years, to make the movement of data affordable, but until I read these patents, I really didn’t think much about that either.

    I’ve added to my vocabulary of terms regarding multi data center sized storage systems with this one too. Search for blobmaster at Google, and it wants to show results for blob monster today. A few days ago, it was trying to show me search results for blog master instead.

    Patents like these are interesting to me because they make me think of Google as just another company trying to run a website, albeit with some special and unique problems that they need to overcome.

Comments are closed.