Lets learn "About TinyURL/bitly system design"

In this tutorial, we'll learn about the system design for a URL shortener. A URL shortener is a service that basically shortens the URL to a fixed size. There are different services available on the internet to do that like tinyURL or bitly.

At first thought, it seems it is very simple. We basically have one DB containing a mapping of INPUT URL -> SHORTENED URL. When the user requests to convert a URL to a shorted URL it hits the service and the service converts it to a short URL and stores the mapping to DB and when the user provides a shortened URL and service will lookup in DB and provides the actual URL. The high-level idea is correct but there are some issues with this approach on the scale. In this tutorial, we'll discuss the scalable system design for URL shortener service.

APIs needed for URL shortener

1. /createShortURL - This API will take a long URL as an input and returns the short URL.

2. /getLongURL - This API will take a short URL as an input and returns the corresponding long URL.

Assumption

There are a few assumptions we are making before proceeding to the design.

Our service should be work with the daily traffic of 1M (1 million) requests.
The shortened URL is of form ll.com/<7 chars>.

With 7 alphanumeric characters in the URL, the number of shortening URLs we can generate is 62^7 = 3.5 Trillion unique characters. Why 62? Because every character can be A-Z or a-Z or 0-9 ( 26 + 26 + 10 = 62 ).

Our service is horizontally scalable.

Data Model

Let's learn about the data model, how each entry in DB will look like.

For each entry in DB, we'll be reserving this much space

<INPUT URL> - 2Kb ( 2048 chars )

<SHORT URL> - 7b (we'll only be storing last 7 chars, from ll.com/<7chars>

<CREATED ON> - 7b ( Each byte consists of 2 packed decimal digits. The first 4 bytes represent the date, the next 3 bytes the time, we won't be storing fraction of second to save space )

<EXPIRED ON> - 7b

TOTAL = 2.021Kb

Deciding the database type

Now that we have designed our DB, let see how much space we need in a year based on the above assumption.

1Million entries per day -> 30Million entries per month

So 30Million x 2.021 = 60.63Gb/month

So 60.63 x 5 = 3.03Tb/year of storage we need.

We have two choices either we can use RDBMS or we can use NoSQL.

RDBMS:

Advantages:

A - Atomicity

C - Consistency

I - Isolation

D - Durability

Disadvantages:

Not scalable. Though sharding is possible but it's complicated

NoSQL

Advantages:

Eventually Consistency

Easily scalable

High availability

Redundancy

From the above theory, the main requirement is our design should be scalable so we'll be using the NoSQL database as our database type.

Unique <7Chars> ID generation

Let's discuss the different approaches to generate the 7 characters long unique alphanumeric id from a long URL. The approach we follow should ensure that

7 characters should be unique for each long URL.
There must be no conflicts.

Approach 1 ( Checksum )

The first thing that can come into anyone's mind is to use a checksum (Say md5) of the input URL. But the problem is md5 output 16 chars long. One can argue that we'll only be using the first 7 characters of md5 checksum. But the problem here is md5 checksum ensures the whole output cannot conflict for two different input but the first 7 characters can be the same for two different inputs. Hence this approach doesn't respect the second condition which is " THERE MUST BE NO CONFLICTS "

Approach 2 ( Base 62 encoding )

As we have discussed before that in the 7 alphanumeric characters we need to generate, every character can have one out of 62 possible values. Why 62? Because every character can be A-Z or a-Z or 0-9 ( 26 + 26 + 10 = 62 ). And the number of such possible combinations is 62^7 = 3.5 Trillion.

What we can do is generate a random number for each URL and perform base62 encoding of that and then store the mappings. Below is the code to do base62 encoding.

#include <iostream>
#include <string>
#include <map>
using namespace std;

map<long long int, char> charMap;

string base62_encode(long long int num) {
    string encoded_string = "";
    for(int i=0;i<7;i++) {
        encoded_string = charMap[num%62] + encoded_string;
        num = num/62;
    }
    return encoded_string;
}


int main()
{
    // 0-25 (a-z) 26-51 (A-Z) 52-61 (0-9)
    for(int i=0; i<62; i++){
        if(i < 26) {
            charMap[i] = 'a'+i;
        } else if(i < 52) {
            charMap[i] = 'A'+i-26;
        } else{
            charMap[i] = '0'+i-52;
        }
    }
    
    for(long long int i=0;i<20;i++){
        cout<<base62_encode(i)<<endl;
    }

    return 0;
}

But the problem with this approach is that what if two requests come at the same time and we end up generating the same random number for two URLs. In that case, we'll be having the same shortened URL for two different URLs that we don't want. One can argue that to solve this issue while saving to DB we can use COMMITIFNOTSET which means save only if there is another entry available with the same value (i.e shortened URL) But the thing is this feature is available in RDBMS only, and since we'll be using NoSQL. Hence this approach does not satisfy the condition " THERE MUST BE NO CONFLICTS ".

Approach 3 ( Base62 encoding with counter ) [ PREFERED ]

Basically, this approach is similar to the 2nd approach but the difference is we'll be using a global counter instead of choosing a random number. This counter is synchronized and it will ensure that for every request coming to a new counter value will be generated and we'll use that base62 encode that counter value to generate the last 7 characters of the shortened URL. This approach works great.

An example of how this work:

1. Assume counter values in the current state is 150.

2. A URL shortener request comes to App server 1, It contacts the counter service to give a counter value. Our counter will return 151. App Server 1 will base62 encode 151 and store the URL mapping to encoded value in DB.

3. A new URL shortener request comes to App server k, It contacts the counter service to give a counter value. Our counter will return 152. App Server 1 will base62 encode 152 and store the URL mapping to encoded value in DB.

4. And so on...

But the problem here is Counter server is a single point of failure. If counters service fails, then our URL shortener service will go down. Hence we need to improve our counter service. Let's discuss how.

At first thought, what comes to mind to solve this problem is to keep multiple counters service. We'll split the possible values that counter service can return across different counter service and each set of app servers will contact one of the counter services to get the next counter.

But the problem with this approach that if the traffic on some of the app servers connecting to one of the counter services is more compared to others, then that counter server can go out of its limit and there is no way to reset the counters. Also what if a new app server is added or existing app server is removed. How we'll be doing the re-assignments.

The way to solve the above issues is to use Zookeeper.

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Zookeeper is basically coordination service. When a new node (app server is added ) it automatically detect that and assign it an ID.

Zookeeper can easily detect any app server going down or and new app server getting added.

The above image explains our final approach using zookeeper. The above approach will give a high performance and collision-free URL shortener.

Details of the above approach.

We will be having a zookeeper cluster keeping all the free slots available { used: false } means its free.
When our application starts, it registers itself with the zookeeper cluster and requests for the free counter slot. Zookeeper will provide the free slot available and set { used: true } for that particular slot. Now the app server will save the counter locally for incoming URL shortener requests.
If an app server goes out of limit it's local counter it can request a zookeeper cluster for the new slot.
This approach will take care of the new app server addition since the zookeeper is responsible for providing a new slot to the newly added app server.
After getting the counter value, the app will base62 encode the counter value and store the mapping to DB as well as cache.
The cache is basically to reduce the DB reads.
Now because we know that the counter values are unique, we can directly put data in DB without checking if that shortened URL is already present which gives us high performance and collision-free URL shortener.

THANKS FOR READING THIS TUTORIAL. FEEL FREE TO COMMENT BELOW IF YOU HAVE ANY DOUBTS. AND STAY TUNED FOR MORE TUTORIALS :)

Search This Blog

Lets learn