Thoughts

mental health break ,./'"**^^$_---
I just spent literally 30 minutes looking for this HN comment because it is so phenomenally dumb.
Discussing Discord, HN commenter says: > Their search makes me want to pull my hair out. > Why can't I just search the history with grep? That's a feature I would pay for (if anyone at Discord is reading this and wants my $10/month) Reply from a member of the Discord engineering team: > Because it would be obscenely expensive (in terms of computational resources) to search history via grep. Hence why we use an inverted index. This Discord engineer understands indexed search. This HN commenter is proposing O(n) time search. Random HN user: > This is megabytes (at most) of text we're talking about, not gigabytes. And ripgrep is absurdly fast. Grepping a 5MB text file should be pretty much instantaneous. Discord engineer: > Do you know of any service that is running at scale out there that stores their data as 5mb plain text files? Random HN user: > The “text files” don’t have to be the primary backing store - merely a derived cache file that can be periodically refreshed with new content. It doesn’t have to be text either - SQLite is pretty structured if you need more features in a minimal container too. At this point the Discord engineer stops replying. For context. In 2017 when Discord added search they did an engineering blog post about it: => https://discord.com/blog/how-discord-indexes-billions-of-messages At the time, their search infrastructure was: > 14 [ElasticSearch] nodes across 2 clusters, using the n1-standard-8 instance type on GCP with 1TB of Provisioned SSD each. The total document volume is almost 26 billion. Discord's userbase has increased drastically in the 6 years between that article and this HN thread. This user was serious explaining. To a Discord engineer. How they could use PLAIN TEXT FILES. and O(N) SEARCHING. on literally TRILLIONS of messages. => https://discord.com/blog/how-discord-stores-trillions-of-messages So when they say 5MB text file obviously they're assuming some pre-division/indexing on server, maybe. So how many files is that. From the link above, Discord's primary (not-search) DB has 72 ScyllaDB nodes each configured with 9TB of disk space, but I suspect that's more than the index. Somewhere between 14TB and around 300 TB. Maybe 200TB. So you end up with 40 million 5MB text files. Sure. Why not. Well because the biggest servers are going to have WAY more messages than smaller servers. So would have to spend years of engineering effort to build some sort of index over these text files in order to know what text file you're searching and keep your run time from becoming O(n). Which is what kills me. It's not literally impossible to store messages in plain text files and search them with grep. This commenter is clearly an engineer trying to problem-solve. They're just also an idiot.
Link 11:21 p.m. Jul 05, 2023 UTC-7