Algorithms and Data Structures: TheAlgorist.com
System Design: DistributedComputing.dev
Low Level Design: LowLevelDesign.io
Frontend Engineering: FrontendEngineering.io
General advice for System Design Interview:
Be a Consultant – imagine your interviewer is a customer you’ll be building something for, consider a tangible system, from requirements to post-production, and what that means.
- Consider the customer – is this a good experience? What purpose does it serve?
Drive the conversation by asking great questions and validating assumptions.
Know Why – Defending design tradeoffs is critical.
Interesting Discussion – show advanced or unique edge cases – validate your assumptions and
ask the right questions.
In this chapter we will learn how to generate Unique IDs at scale in a distributed system.
Some very important applications of Unique ID Generators are:
Key Generators (Key-Gen) for generating activation code for software like
Anti-virus Software, Microsoft Office Subscription Key etc.
Twitter, Discord, Instagram and other social networking systems need to create unique IDs for tweets, posts, users,
messages and all other objects that need identification.
- Generating unique IDs is a requirement in many distributed systems. Generating bank account numbers, image ID, file ID, and user ID for large scale distributed systems will need a random ID generator. URL shorteners, such as TinyURL, also use a random ID generator to assign unique IDs to create each short URL.
IDs must be unique.
IDs may contain both numeric values and alphabets.
The system should be able to generate 1 Million IDs per second.
Our service needs to be very low latency, we are talking in millis (milliseconds)
if not micros (microseconds).
- We need to support 1 Million concurrent requests.
Algorithms to generate Unique IDs:
Approach #1: Single Database Server Auto-increment:We can use a centralized auto_increment feature in a single database server to create unique IDs.
This approach is very easy to implement and it works for small to medium scale applications.
This design is a bad idea due to single point of failure introduced in this design for having
dependency on only one database server.
One single database system will not be able to
support 1 Million concurrent connections.
- Depending on one database server to create all IDs will affect the latency.
Approach #2: Multi-instance Database Server Auto-incrementWe can try to address the SPOF (Single Point of Failure) by adding several instances of Database servers. Say, we have N database servers. In this approach instead of increment auto_increment id by 1 we will increment auto_increment id by N. This solves some scalability issues because IDs can scale with the number of database servers, but this approach has some major drawbacks as deiscussed below:
- Hard to scale with multiple data centers.
- It does not scale well when a server is removed or a new server is added.
Approach #3: UUID or GUIDA universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. While the probability that a UUID will be duplicated is not zero, it is generally considered close enough to zero to be negligible. After generating 1 Billion UUIDs per second for 85 years, probability of creating a single duplicate reaches 50%. Thus, anyone can create a UUID and use it to identify something with near certainty that the identifier does not duplicate one that has already been, or will be, created to identify something else. Information labeled with UUIDs by independent parties can therefore be later combined into a single database or transmitted on the same channel, with a negligible probability of duplication. Adoption of UUIDs is widespread.
Exmple of a UUID or GUID is 123e4567-e89b-12d3-a456-426614174000.
Unlike Approach #2, this solution also works well when our ID Generator app is deployed in multiple data centers.
This is a simple yet scalable solution, if we are okay with the below constraints:
- UUIDs or GUIDs are pretty long (128 bits long).
- If your requirement is to have only numbers, this solution will not work since it contains alphanumeric. This works for us since we are okay with having both numbers and alphabets in our IDs.
- If your requirement is such that IDs should go up with time, then this solution will not work. This is not a problem for us since we do not have such requirement.
Approach #4: Decoupled, Highly Scalable ID Generator System
ID generators and Range Generators: In this approach we will have multiple instances of ID generator and each of these instance will be able to generate IDs in a certain range, for example: say we have 2 instances: instance1 generates IDs in the range, say 1 - 10000, instance2 generates IDs in the range of 10001 - 20000. Once instance1 exhausts the range 1 - 10000, and assuming instance2 still hasn't exhausted the range 1 - 10000, instance1 gets assigned the range 20001 - 30000. While generating IDs, each ID Generator instance starts from the low end of the assigned range and increments by 1 each time it has to create a new ID.
The last assigned range are stored in a highly available and durable cache. Any time an instance needs a new range, it does a look-up in the cache to see what the last assigned range was, and then self-assigns the next range, and updates the cache with the range it self-assigned as the updated last assigned range. We can just store the end of the range. If the self-assigned range is [20,001 - 30,000], we store 30,000. Now, if you think deeply you will figure that this whole process is not trivial in large system where there are multiple ID Generator instances and more than one instances are doing a look-up at the same time. Looking up the last used range, self-assigning a range and update the cache with the last used range should be one atomic transaction. This may cause race condition since multiple ID Generator instances can try to these operations at the same time. It would make sense to assign single responsibility to each component and decouple Range generation and ID generation. So we create a decoupled Range Generator component which has a single responsibility of range generation. Also, keep in mind that the ID Generator component will get huge number of requests and will be busy catering to the ID generation requests. It does not make sense to get the system overloaded by adding range generation reponsibility to it, when we are building a system of this large of a scale, keeping high availability and low latency in mind. Range Generator will get a request only when an ID Generator instance needs a new range. So this component will get way less requests than the ID Generator component. Since these two components are decouples, they both can scale independently as per their own scale need.
Whenever an ID Generator instance needs a new range, it calls Range Generator API and get a new range.
One more thing to consider here is whether each ID Generator instance should store the last generated ID in cache or not. This will help if a particular ID Generator instance goes down, it will know what was the last generated id so that it can increment it by one and gets the next ID and so on. However, to make a cache API call to store the generated ID for every ID Generation Request would be very expensive and will impact the latency. So it would be better to not worry about storing the last generated ID. So what are the cons for this? The drawback would be if an ID Generator instance goes down there will be no way to know about the unused IDs from the assigned range and those IDs will be lost forever and there will beno way to recover them. Now, keep in mind that we are working with Trillions of numbers here. If one range of say 10,000 numbers (or whatever the lengths of the ranges are) get lost out of Trillions of numbers, that is equivalent to one drop of water from an ocean. Also, instances going down is not a frequent incident. In a stable, well-architected, well-maintained system it should not happen frequently.
Approach #5: Highly Scalable ID generation Solution inspired by Twitter Snowflake approach
We can have multiple instances of ID Generators and each instance will create an ID using the following schema:
-------------------------------------------------------- timestamp | datacenter id | machine id | local id --------------------------------------------------------
timstamp is the current timestamp when an ID generator instance gets an id generation request.
datacenter id is identifier of the data center where the id generator instance is located.
machine id is the machine identifier of the instance. Every machine has its own identifier.
local id is an id that the instance creates locally for every request. This can be as simple as incrementing by one with every request.
Every instance can have a range and once the generated id reaches the higher end of the range it can reset to the lower end of the range.
The length of the range needs to be greater than the number of requests received per timestamp.
If you have any feedback, please use this form: https://thealgorists.com/Feedback.