Lets learn "System design for paste bin (or any text sharing website)"

 

In this tutorial, we'll learn about the system design for PasteBin. If you don't know what Pastebin is here goes its definition.

Pastebin is a type of online content hosting service where users can store texts to share with others over the internet. 

Pastebin system design requirements 

Functional

  1. Basic - User should be able to paste the text, generate the link and share the link with other users.
  2. There should be the max size of the text that our system will support (say 10Mb)
  3. Users should be able to provide the custom URL path for the link created. Like if a user wants to create the links www.pb.com/letlearn, he can give this URL and pastebin should create a link using this URL to share with others.
  4. The link or paste expiry feature should be there for each stored text. There can be default expiry time say 1 year but this should be configurable.
  5. Personalization means the user should be able to login to track all the pastes created by him and manage them.

Non-Functional

  1. Durability - Once data is written it should always be there till the expiry time.
  2. Redundancy - Our data should have redundancy if one storage device fails, we should not lose the data stored on that disk.
  3. Latency time - Whenever the URL is hit to get the text back, it should not take much time before returning the response to the end-user.

Assumptions

Before designing let's make some assumptions on the basis of which we'll be designing our Pastebin.

Reads and Writes

  1. Number of pastes written per day - 100K
  2. Number of pastes read per day - 10 X Number of pastes written per day ( Our system would be read-intensive )
  3. Max size of each paste - 10Mb ( Average size can be taken as 100 Kb )

Number of bytes exchanged with our system via the network

  1. Average pastes to write per second - 100K/24/3600 = 1.5 pastes/sec
  2. Average pastes to read per second - 1000K/24/3600 = 15 pastes/sec

Data Storage

  1. Number of bytes written per second = Average pastes to write per  x Worst case size of paste  = 1.5 x 10Mb = 15 Mb/sec = 365 Tb/year
  2.  Number of bytes read per second = Average pastes to read per second x Worst case size of paste = 15 x 10Mb = 150 Mb/sec
We'll be needing around 365 Tb a year of storage for the above assumptions.

From the above assumptions, it is clear the read-to-write ratio is 10:1. 

Database Table Structure

If one paste size is 10Mb we need not store the full content in one blob. We can decide on block size and split the paste content into different blocks and during the read request, we can provide the first block instantly to give the preview and read the other blocks in the background which kind of gives a good user experience. We'll only be storing the first block in the DB and the rest of the content is saved in some storage server.

Paste table in DB will look like:

PasteID

Content

Storage Server Link

CreatedOn

ExpiredOn


User's table in DB will look like:

UserID

UserName

CreatedOn

Metadata (optional)


System Design



Let's talk about the individual components briefly.

1. API Gateway: API gateway is the entry point for any API. Every API will first come at the API gateway. And from API Gateway it gets routed to one of the Read servers (in case the endpoint is read one), otherwise routed to the one the write server because API wanna write a new paste.
2. Read Paste: Read paste is a set of servers/containers which will be serving the read requests. These can use cache to fasten the response of frequently requested pastes.
3. Write Paste: Write paste is a set of servers/containers which will be serving the write requests. The number of write servers would be less than the read once because reads are more frequent than writes. It interacts with the key generator server as well to generate the unique fixed-size key for paste being written to the DB. Also, it interacts with the storage server to store the large pastes for faster access. One the preview of the pastes will be written in DB.
4. DB: DB is the database in the system. It is used to store the metadata of the paste and preview of paste content.
5. Key Generator: For every writes request we need to generate a fixed size, non-predictable key. Key Generator is responsible for that. The key generator can use the same method as used by tiny URL system design discussed here.
6. Cache: Read paste uses the cache to return the content of frequently requested paste in less time.
7. Storage Server: As discussed, DB would be storing the preview of paste content.  The full content is stored in a storage server. DB will maintain the pointer to a file in storage server.


Hopefully, now it's clear how Pastebin service works, what are the different components in the service, how they interact, and what is the dataflow in b/w different components.

HOPE YOU LIKE THIS TUTORIAL. FEEL FREE TO COMMENT BELOW IF YOU HAVE ANY DOUBTS. AND STAY TUNED FOR MORE TUTORIALS :)


Comments

Popular posts from this blog

Lets learn "About kube proxy in iptables mode"

Lets learn "Factory design pattern"