Imagine you've been given a major project. To understand its scale, you discover that the platform already has 100 million users, and every day, X new users are joining. Your task is to check whether a given username already exists in the system when new users sign up.
-- Returns 1 if the username "john" exists, otherwise returns 0.
SELECT EXISTS(SELECT username FROM Users WHERE username = 'john');
If your first instinct is to write a query like this, hold on! This post will help you optimize the lookup process from O(n) to O(1). So, let’s dive in! ✨
Understanding the Problem
SQL checks data row by row, following an iterative approach. Before performing the check, it also needs to load the data into RAM. Even though databases manage memory efficiently, loading large datasets still affects performance.
Now, let’s say our database has 24 million users, meaning 24 million rows to scan. As the number of users continues to grow, this could reach N rows, making our lookup complexity O(n). Checking all rows every time a user picks a username is inefficient. We need a better approach—let’s start by exploring caching.
Caching: Is It a Solution?
Caching is great for rarely changing and frequently accessed data. For example, imagine building a stock market platform like Bloomberg, where stock prices change every second. If you cache a page, users will see outdated prices instead of real-time updates.
Since cached data is stored in fast memory, servers can return results quickly. However, in our case, we have millions of records, and we can't guarantee that usernames won’t change frequently. This makes caching an unreliable solution. We need a faster lookup method.
Binary Search: A Step Forward?
Since scanning every row is too slow, we consider Binary Search, one of the fastest search algorithms. If we have 100 million records, Binary Search can find a result in just 27 comparisons, which is incredibly fast.
However, Binary Search requires the data to be sorted. Every time a new user registers, we would need to re-sort the data. Even using QuickSort, this would take O(n log n) time, which is still inefficient. So, is there an even better solution?
Bloom Filter: The Optimal Choice
A Bloom Filter is a space-efficient probabilistic data structure. Unlike traditional data structures, it doesn't store actual values but estimates the presence of an element with a small probability of error. If your system can tolerate occasional false positives, Bloom Filters can significantly improve performance.
We know that HashMaps and Sets can find elements in O(1) time because they use hashing. Since Bloom Filters also rely on hashing, they achieve O(1) lookups. However, they come with a trade-off: false positives. This means the filter might mistakenly say a username exists when it actually doesn’t. On the bright side, Bloom Filters never produce false negatives, meaning if it says a username doesn’t exist, it’s 100% correct.
How Does It Work?
A Bloom Filter uses a bit array of size N, where all bits are initially set to 0. When inserting a username, a hash function calculates an index, and we mark that position as 1.
[ hash(username) % N = index ]
Later, when checking if a username exists, we check if the bit at that index is 1. If it is, the username might exist. If it's 0, the username definitely doesn't exist.
False Positives: The Trade-off
If two different usernames generate the same hash index, the Bloom Filter may incorrectly indicate that the second username is already taken. This is known as a false positive.
To reduce false positives, we use multiple hash functions. Each username is stored in multiple positions in the bit array, increasing accuracy.
While false positives can never be eliminated entirely, mathematical formulas help determine the optimal number of hash functions and bit array size to keep errors minimal.
Conclusion: Choosing the Right Approach
An O(n) approach isn't always bad, and an O(1) solution isn't always perfect. The key is understanding your system’s needs and choosing the most efficient method accordingly.
This post aimed to explain when and why Bloom Filters are useful. In a future post, I’ll discuss cases where an iterative approach can sometimes outperform a constant-time solution.
Share this post and help others grow! (And don't be like the guy in the meme below 😆)