2 October 2024

Decentralised Open Indexes for Discovery (DOID)

TLDR; A conceptual and technical framework for resource discovery on the WWW using decentralised, open, machine-readable indexes as the building block, free of eroding quality and gatekeeping by BigSearch™ and BigPlatform™, whose goals are not quality, but revenue.


I have been using Kagi[1] for web searches ever since Google’s search quality degraded[2] to a tipping point for me, where I just gave up trying to coaxe it. This is inspite of having used an ad blocker since AdBlock Plus became mainstream on Firefox around 2006. The megabytes of JavaScript bloat that webpages download to render a few paragraphs of text and images—which jump around before settling between ads—is an insult to the user’s intelligence. How does one even find decent quality resources on the WWW (World Wide Web) when the biggest windows of discovery are all targeted advertising companies themselves?

Honestly though, browsing the web in the past had already become painful due to overtly spammy sites, poor search results, flashing banner ads, popups, and malware and exploits spread through non-sandboxed embeds. Today, the obviously spammy pages have evolved into more professional-looking websites that expand two lines of information into lengthy paragraphs of filler content, carefully crafted to cater to SEO (Search Engine Optimisation).

With search specifically, there was a period when Google did a great job providing access to high-quality search results that linked to valuable resources and penalised junk. There was decent separation of signal from noise. That Goldilocks era is long gone. And now, with the emergence of LLM (Large Language Model)-powered instant question-and-answer engines, the very need for “searching” the web, in the conventional sense, seems to be rapidly diminishing. Ironically, it is the same LLMs that are being used to generate vast amounts of SEO spam and pollute the WWW at an exponential rate. Perhaps now is the time, once again, to explore alternative approaches to search and discovery on the WWW.

Searching, browsing, answer-seeking

We have been relying on search engines that cover the entirety of the WWW to browse through resources they return, looking to piece together answers to questions we have. For instance, “How many districts does India have?”, “How many A4 sheets would I need to cover the surface of the Earth?”, or “What is the minimum time required for brewing Darjeeling tea?”. Many questions have concise, factual, or computable answers, while others have multiple subjective answers that vary based on parameters such as cultural and regional contexts. Either way, why does the entirety of the web, of which a significant portion is SEO and marketing spam, have to be searched to retrieve answers to such questions anymore?

If the quest for developing “expert systems”[3] and question-and-answer agents had yielded generally useful tools a long time ago, the knowledge-seeking behavior of billions of people, on the internet, would look very different today. This finally seems to be on the cusp of changing after the advent of technologies like LLMs, enabling natural language expert systems that can seek and synthesize concise answers (setting aside the obvious questions of veracity and validity for now).

Over the past year, as a programmer and tinkerer, the vast majority of my technical queries have ceased being conventional “searches”. They have become direct questions posed to LLM-based systems that yield instant answers, which I then validate using personal faculty or other means, for obvious reasons. Even then, what would generally take tens of minutes, sometimes hours, of clicking through webpages (mostly technical documentation, forums, StackOverflow, GitHub issues etc.) looking for a solution to an obscure technical hiccup, now takes seconds or minutes. The amount of time I have been saving on a daily basis makes me realise the ridiculous tedium of conventional web search as the go-to means for answer-seeking.

Apart from the question-and-answer queries with LLMs, technical and otherwise, my web searches now mostly involve queries for discovery and general research on topics I am curious about. To be honest, the significant majority of those end up on Wikipedia pages, a resource trusted to a good extent by hundreds of millions of people, including me. Other good resources lie buried deep within piles of SEO fluff and spam, never to be discovered via BigPlatform. How can one, without significant effort, even evaluate search results from them, whose incentives are purely financial rather than optimal resource discovery? If recency of a resource is a key metric search engines use when surfacing results, which it seems now is a big factor indeed, then quality as a parameter is already out of question.

In summary, for the bulk of my daily queries, I need a question-and-answer system. For the remaining queries, I need a search index that points to resources that have content on topics that I am interested in. There is a clear distinction between the two.

Speaking of question-and-answer systems powered by LLMs, two separate debates have been set aside in the context of this post—the validity of LLM-produced answers (which are actually pretty good for technical queries, for me, at least) and the ethics and philosophy of these systems absorbing human-produced knowledge wholesale.

Quality

With each passing year, my conviction in the inevitability of the rather dramatic sounding Dead Internet Theory,[4] grows stronger. The WWW has become a cesspool of bot-generated spam and lowbrow, eye-catching, clickbait “content” churned out non-stop in bulk, driven by perverse incentives. There is significant amounts of good stuff of course, which lie buried deep, never to be discovered. With the advent of generative AI systems like LLMs that enable mass production of content in various forms, the Pandora’s box has been busted wide open, accelerating the decay. While it should worry us that a large population consumes and enjoys such content, an even more alarming thought is that this version of the web is becoming the baseline for young people growing up with it. The spam, ad, bot, and clickbait-filled, social media-centric internet, curated by BigPlatform blackboxes, is becoming the defacto mental model of the WWW for many. This trend is so pervasive that the very foundation of the internet—a decentralised network of networks—may eventually be forgotten.

The crux of the matter here is not a judgment of the “quality” of content people like, but the fact that it is all enabled, hosted, curated, and pushed, very conveniently, by a handful of large entities, whose goals are not the dissemination of knowledge or resources. The pathways to discovery of resources on the WWW are highly homogenised, gatekept, and controlled to the point that a large number of people may not even realise that there can be many pathways. The fact that realisation and resistance exist only in niche communities, is not ideal overall for the future of the WWW and humanity in general.

All that said, there is always a tipping point in large, complex, chaotic human systems. When this “content” race reaches rock bottom, a critical mass of people will get sick of it and look for alternatives, at which point, viable alternatives must be available. On the flip side, there has always been and will always be a small but significant population that creates and maintains stuff, here, digital artifacts on the WWW, for reasons other than perverse incentives. The question is, in an environment with an exponentially poor signal-to-noise ratio, how can any of it be made discoverable without BigPlatform being the arbiters of what is worth discovering?

Assumptions and leaps

Following are a set of assumptions and observations about the current state of WWW and search. While I draw these from a blur of stewing thoughts and annoyances over the years, many of these now seem to have quantifiable evidence, or at least consensus from an increasing number of places.[5]

  • We seem to be now in late stage, wholesale enshittification[6] of tech and the internet in general.[7] There are also its cousins; surveillance capitalism[8] and attention economy, the primary driving forces of degradation.[9]
  • The WWW, especially with content generation AI technologies, is reaching a saturation point of low-quality “content” and SEO manipulation, accelerating the Dead Internet phenomenon,[10][11] making it exceedingly difficult to find genuine, quality resources.
  • Despite that, there was, and always will be, a small but significant number of actors who create quality resources for the right (or at least less-wrong) reasons and publish it on the WWW.
  • There is no point in scraping and indexing the whole of WWW anymore, if most of it is garbage. Such a search system has fast-diminishing value. How many of us ever even go to the second page of search results?
  • Natural language question-answering systems will likely absorb many queries that would otherwise have been web searches, further reducing the value of global WWW search.
  • It is practically impossible to build better competition to BigPlatform without massive resources. But, based on the above assumptions, it may no longer even be necessary to build.
  • BigPlatform are becoming less trustworthy as the gatekeepers and purveyors of resources on the WWW. Their algorithms, driven by perverse incentives, no longer prioritise quality as a metric for surfacing resources on the WWW. The legal and geopolitical conversations about digital monopolies, large platforms, matters of anti-trust etc. becoming mainstream in the last five years, lend further weight to this fact.
  • A good amount of public trust in them and their blackbox algorithms will diminish, leading to a reversion to trust in peer and community recommendations for knowledge-seeking and discovery. The decline in efficacy of reviews on large online platforms[12] is yet another indicator for this trend.
  • A significant number of people will seek alternative pathways to access the WWW. Now feels like an opportune time to experiment with alternative approaches.
  • Unless the UX (User Experience) for such approaches is simple and accessible, they will struggle to find adoption. UX absolutely matters.

Deconstructing discovery

A web resource[13] is the most fundamental constituent of the WWW—a webpage or any other resource identified by a URL. A search system needs a list of URLs to crawl, index, and make them discoverable in the form of search results. The fundamental issue is that search engines, in their quest to cover the whole WWW, pick up URLs from all over, automatically, and then rely on algorithms (like PageRank[14]) to determine quality and relevance. This may have been a reasonable approach in the pre-dead-internet and pre-enshittification era, but seems largely futile now. Perhaps even a Borgesian library,[15] which the WWW has started to resemble in an eerie manner, would be useful to mortals if it had curated indexes!

What if URLs to be indexed, classified under arbitrary areas of interest or categories, are curated by people and communities, based on whatever criteria that suits them? For instance, an individual or a small group of individuals with an interest in tea who are likely to know and come across resources on tea, curate and put them up somewhere? Or a community of protein researchers start jotting down the list of resources they access for their research somewhere publicly? Whether somebody cares about such lists or not is immaterial. That will go through the dynamics of decentralised adoption and acceptance like any other resource on the WWW.

Is such a premise possible? Of course. DMOZ[16] was a big deal back in the day—I, like many, was an avid user and it was one of my primary entry points to the WWWW. Today, again, I, along with many millions of people, almost blindly trust EasyList[17] to block ads and make browsing the web bearable. GitHub is full of curated lists of quality technical resources, generally referred to as “awesome lists”.[18] There are lists that are trusted, there are lists that have gone bad and caused loss of trust, and there are lists that have been abandoned—standard dynamics of human curation, recommendation, and trust that play out in random decentralised ways.

Nothing new here, really. However, lists of URLs with no UX on top are of little value to the average user. People have always collected and curated lists of URLs in various ways since the dawn of the WWW—everything from webrings[19] to directories to Delicious[20] to StumbleUpon.[21] Many specific tools and platforms have died out or have been degraded, slowly turning into social networks. As someone once said, “A piece of software either dies a utility that falls out of favour, or lives long enough to be forcefully converted into a social network.”

On the other hand, a vision of search and resource discovery on the WWW, fundamentally based on freely growing and exchanging lists with some element of trust, where every building block and actor in the whole scheme is decentralised and adheres to certain standards or specifications, does not seem to have gained traction.

Structure

There are cool experiments like the Gemini protocol,[22] but attempting to drive the adoption of a new protocol with strict prescriptions for how content and information should be laid out, is a tough, uphill battle, if wide adoption is a goal. This was one of the factors that ultimately led to the demise of the Gopher protocol.[23] Then there are systems like YaCy,[24] specific pieces of software that offer different kinds of search and discovery experiences but require technical expertise to operate. An approach where technical prescription is as minimal as possible as far as the end user is concerned, and is compatible with the current scheme of things, is worth attempting.

Thus, Decentralised Open Indexes for Discovery (DOID), is an abstract set of ideas accompanied by minimal technical specifications. It is meant to enable shareable, standardised, machine-readable indexes of web resources as building blocks for higher-order search and discovery experiences, banking on the ideas of decentralised curation and trust.

  • A structured machine-readable index: This would be a file, distributed under an open license, in RDF/XML, JSON, CSV, SQLite, Arrow, Parquet—whatever turns out to be the optimal pick, taking into consideration not just semantics, but practicality of storage and distribution as well. A 20 GB index in one format vs. 2 GB in another will make a huge difference to adoption, so recommendations for preferred formats would be ideal. An entry in the index is a URL to a resource (a webpage, a PDF file, a YouTube video, or any valid URL) with some basic metadata that is useful to humans—say title, description, timestamps, tags, and some kind of text tokens that enables basic search UX on top of it. Any individual, group, community, or organisation can make and distribute indexes. They can cover the broad WWW, or they can be niche, topic-based. Perhaps an index of curated music-related resources with tens of thousands of entries, or one on history with millions of entries. Whether it is done by hand or with automation, or both, does not matter. Adoption and usage will depend on the dynamics of public accountability and trust, like how it works with FOSS (Free and Open Source Software), or a publicly maintained resource like Wikipedia. Some efforts will be smooth, some will be tumultuous, some will fail. It does not matter. There is no centralised platform.

  • Tokens: Now this is the tricky bit. An application that reads such an index, if it cannot offer immediate, semi-decent search and discovery UX on top of it, would be no good for the average user. The challenge here lies in figuring out what types of tokens a standardised index can contain that is reasonably universal—fulltext, vectors, or provisions for pluggable algorithms. The goal is to enable discovery of topics or concepts a resource represents, not to index all its contents. Indexes that ship with tokens enabling good discovery UX would gain adoption, while others may struggle. Thus, creation of a rich index requires would require some techical chops, which is different from link curation.

  • Distribution: If someone imports the Wikipedia dump and creates an index out of it, the file would weigh several GBs with millions of entries. The infrastructure requirements, technical effort, and cost for distribution thus becomes prohibitive from the get go. Well, not if the spec provisions for it with something like BitTorrent as the preferred distribution method, in addition to other modes. A mechanism for delivering regular, differential updates of indexes is also necessary, which is a reasonably difficult engineering problem to solve and standardise.

  • Clients: This area offers the most potential for innovation. Clients consume standardised index files and provide basic search and topic discovery features. The specific UX and features are up to the client implementation, and clients can compete on providing better experiences and value additions. Clients can be mobile apps, desktop apps, serverside apps, or web apps. Clients can be installed locally, or someone could install a web based client and offer a search discovery service to many. Clients can be generic, or there could be ones that focus on niche topics or aggregate multiple indexes—a pharma research discovery engine, for instance. Some clients might even crawl resources in an index to build full-fledged content search engines, potentially leading to the emergence of more indie search engines like Marginalia.[25] While many clients and offerings may come and go, many might fail and succed, diverse, non-homogenous choices beyond those dictated by BigPlatform can emerge, which is the goal.

Ah, to dream! … but within the realms of practicality and viability ;)

What’s new?

All these pieces and facets already exist in isolation. Nothing prevents someone from curating a list of resources, turning it into an index, and distributing it, nor does anything prevent someone else from building a nice, accessible UX on top of it. It does not happen at a meaningful scale because all the different pieces of the puzzle need to be articulated coherently with some uniform technical standards laid out to deliver good UX.

The articulation in this post is an alternate vision based on concepts that already thrive and engineering problems that have been solved elsewhere. At the heart of it is the idea of freely flowing, open, digital artifacts—self-contained indexes of WWW resources that are created, maintained, grown, distributed, disputed, discarded, forked, and improved, by non-big-platforms. Artifacts whose lifecycles are governed by human dynamics of trust, recommendations, and adoption, with the implicit accountability that emerges from them, like in the FOSS ecosystem. Dynamics that play out however, wherever, in a highly decentralised fashion, again, free from BigPlatform.

It stems from the belief (and mounting evidence) that on a rotting WWW, BigPlatform and their algorithms cannot be trusted to be the gatekeepers of discovery of good resources buried in the noise; that conventional web search is dying and we need to prepare alternate forms of discovery for the inevitable tipping point; that we need to start thinking of ways to separate out the 0.0001% signal from the 99.9999% noise to make them accessible.

What next?

When I manage to find a bit of time, I plan to attempt the following. In the meanwhile, there is a mailing list, in case there is any interest out there.

  • Draft a v0.0.1 technical spec describing all of the different pieces—the schema for the index, optimal machine-readable file format(s), distribution and update mechanisms, the schema for a manifest file describing index files, to accompany them. The effort here is nailing a reasonable spec and file formats with reasonably low surface area. There is no need to dictate what clients do.
  • Create a prototype index out of the Wikipedia dump with links to every single article, each with a set of tokens that covers important topics in it. This could use a bit of cleverness, perhaps something as simple as a tf-idf along with Wikipedia’s own semantic metadata, or even get an LLM to synthesize tags based on perceived importance for a given article. For instance, the entry for Queen (band) could have several dozen tags, describing important topics in it, such as music, rock, pop, 1970s, Freddy Mercury ...
  • Create a prototype web app, with good UX, that loads this index and offers an instant topic search and discovery experience on top of Wikipedia.
  • Iterate and make more prototypes and wait and watch.
Diagram of constitutents of DOID
Diagram of constitutents of DOID

FAQs

… that I have posed to myself.

Meh, techno-utopian mumbo-jumbo.

Maybe. Sure. At least, we are only talking about sharing open files created by whoever wants to create them, not building some mega global, connected, platform. Even from an engineering perspective, the pieces seem viable. The goal is not to create something like a browser or a global network from scratch.

Why would anyone bother creating an index?

People have always done so and will continue to do so. From DMOZ of the yesteryear to the numerous “awesome lists” on GitHub to EasyList to the entire FOSS ecosystem, there will always be individuals, groups, communities, and institutions that invest effort into creating and maintaining quality, valuable digital artifacts and resources. Someone could just curate links and resources. Someone with tech chops could create an index on top of it and distribute. Anything is possible.

That is not the goal. The argument is that there is no point in even indexing or searching the vast majority of the WWW anymore. The goal is to pave ways to alternate paths of content and topic discovery, not perform full text or content search of entire resources. There could emerge engines and tools on top of the indexes that do that.

What stops someone from going rogue or selling out and polluting an index?

Nothing really. But the same dynamics that play out in the FOSS world—loss of trust and reputation, forking etc. for instance—can apply here too. The idea of choice and exit paths are inherent in the nature of openness and decentralisation. When someone abruptly closes a widely used FOSS project, someone else forks it and if there is enough demand, it lives on.

Fine, how does one discover indexes in the first place?

Many mechanisms could emerge naturally. Word of mouth, index search engines and aggregators, directories, trackers, registries like npm,[26] PyPi[26] etc.

This will die a hacker’s pipe dream

That’s okay. Maybe this won’t even be born to die. The goal is to attempt. To attempt alternate visions for discovery of resources on the WWW that are outside of BigPlatform. If not this, then something else. From a Gemini protocol to machine-readable indexes that optimistcally bank on societal dynamics of trust, there have to be experiments.

Bonus

In a 4D-meta-recursive-ironic plot twist, I was playing around with the Google NotebookLM[↗] AI. Here (MP3, 8.5MB) is a two-person podcast-style conversation generated by it on this blog post. It is quite something, but then again, it is yet another “content” floodgate that is about to open where its utility will drown in noise.

* This post is licensed under CC BY-SA 4.0