Sat, May 25, 2013

Realtime Javascript autosuggestions for 60,000 words

An autosuggest solution developed for achieving realtime search suggestions in the browser from a database of about 60,000 words.

Scenario

Make about 60,000 English words in a MySQL database available to suggestions in a text box as a user keys in a query
Reduce the latency to achieve realtime suggestions, akin to Google search suggestions
Minimize the server load and redundant requests to the database as much as possible
Scale to millions of requests a month

Direct queries to the MySQL database — bad

To begin with, it doesn’t make sense to query the database for a word match on every single keypress. It’ll quickly start taxing the database with unnecessary load. Usually, when a user keys in the query apple, for instance, with every keypress, the database is queried with the following patterns: a*, ap*, app*, appl*, apple*. That is five requests for 5 keystrokes (assuming this is on top of a simple input thresholind timer). Far from efficient.

Even if the database server could handle queries without any hiccups, it’s still impossible to achieve realtime suggestions. There is no easy way to overcome network latency (unless you are Google). If it takes 150ms to get back with suggestions for a keystroke, it isn’t great user experience.

Send everything to the client

To get realtime suggestions, really, the suggestion data has to be at the user’s end, stored in the browser. The network connection and the database querying has to be taken out of the equation. Now, sending 60,000 words (amounting to multiple MBs uncompressed) to a user’s browser is nonsensical. What can be done, however, is sending subsets of it.

When the user types the first letter of a search query in the text box, for instance “a”, let’s say there are 4,000 words in the database beginning with a*. Instead of querying the database, if the entire set, all 4000 words are sent to the browser, and thereafter, the querying and population is done by Javascript locally from the browser, we have achieved realtime suggestions.

Fetching 4000 words via Ajax may sound heavy, but it’s actually not all that bad.

Remove unwanted cruft from the words to reduce the number of bytes
Use a single character (\n or \t) separated list instead of JSON to further save precious bytes
It’s a good idea to sort the words to aid something like a binary search at the clientside
Use the magic wand—Gzip the data

This way, in my particular case, 4000 words that amounted to 90 KB raw was cut down to a mere 16 KB. It would only take a few milliseconds for an Ajax request to finish receiving this data. The only request to the server, and the only minor delay that happens is when the user presses the first key.

Cache it for good

We’ve just received 4000 words beginning with “a”, which means we don’t need to talk to the server anymore to get suggestions for any query that begins with a for a long time.

HTML5 localStorage is your friend. Cache the received subset for as long as possible. If the nature of the data is such that it doesn’t change in a month, make the clientside logic only request suggestions from the server after a month (a simple timestamp flag again in localStorage). All autosuggestion goodness happens in the browser and the server is not contacted for good.

Real search happens at the clientside

Once we have the entire possible dataset for words beginning with a* in the browser, it’s easy to implement something like a binary search, or even a serial indexOf() substring match to get N number of word matches. Pipe it to any favorite Javascript autosuggestion library.

Only 26 requests to the server at worst

This way, no matter how many queries a user types in, it can only ever result in a maximum of 26 requests to the server—a-z. It’s also reasonable to assume that a user is not going to search for words that begin from a-z in one sitting. In an typical case, a user will only generate a few handful of requests. These requests would be sufficient for all autosuggestions for the foreseeable future, as long as the cache is good. Hundreds of thousands of expensive requests to the server have been cut down by an order of magnitude.

The server

In my case, I went ahead a step and took the MySQL server out of the equation completely. Instead, a standalone server that holds all the words (taken from the MySQL database) in memory in a list and responds to the Javascript queries directly, took its place.

Using something like Apache for this purpose would of course be like employing an elephant to transport a matchbox. I ended up using CherryPy, a fast pure Python HTTP server (Tornado was my first preference, but turns out it develops mysterious memory leak issues. After about 3 days of serving a few hundred thousand queries, it’d consume half a GB of RAM and become unresponsive, needing a restart).

When the server is run for the first time, it gathers all the words from the MySQL database and puts them in a dict with 26 keys, a-z. Each key, for example, a, holds a list of all the words that begin with it.

When the server receives a request from the browser, it simply outputs all the words in the list, Gzipped. Here’s the pseudocode for the CherryPy word server.

import cherrypy

class Server():

	WORDS = {}

	def __init__(self):
		"""Initialize by gathering all the words from the database and storing
		them in a dict in memory"""

		self.WORDS = self._get_indices( self._get_db_words() )


	def get(self, letter = None):
		"""Return words matching a query"""

		if letter and letter in self.WORDS:
			return "\n".join(self.WORDS[letter])
		else:
			return ""



	def _get_db_words(self):
		"""Get the words from the database"""

		# do the databse logic here and return a list of words, sorted a-z
		words = []
		return words


	def _get_indices(self, words):
		"""Create dict with keys (a-z) with the first letter of the words 
		beginning with those letters falling into the appropriate keys"""

		indices = {}
		for w in words:
			letter = w[0].lower()
			if not letter in indices:
				indices[letter] = []

			indices[letter.append(w)

		return indices

# ===

server = Server()

# cherrypy config
cherrypy.config.update({
	'log.screen': False,
	'tools.gzip.on': True,
	'server.socket_port': 8080,
	'server.socket_host': "localhost"
})

class App(object):
	def suggestions(self, word=None, full=None, *args):
		cherrypy.response.headers['Access-Control-Allow-Origin']= '*'
		cherrypy.response.headers['Content-Type']= 'text/plain'
		return server.get(word, full)

	suggestions.exposed = True

if __name__ == "__main__":
	cherrypy.quickstart(App(), config=config)

Lesson

Of course, this is a very particular use case and may not apply to other autosuggestion scenarios. However, the underlying concept still stands. If it’s word matching, it’s a good idea to send chunks of data to the client, cache it, and run the search logic there (provided the data is non sensitive and you don’t mind it residing with the client).

Users of the Olam dictionary now enjoy uninterrupted, realtime word suggestions as they type, for the entire dictionary. The CherryPy server handles millions of word requests every month with no hiccups whatsoever.