This is a quick English translation of a draft I wrote about the Threshold Algorithm of Fagin, Lotem, and Naor, which received the Gödel Prize at ICALP 2014. I’ve run this through Google translate and ironed out the worst grammatical quirks by hand, but the result is far from idiomatic English. My apologies. A slightly abridged version of the Danish text was published in the Danish weekly broadsheet Weekendavisen 2014 #27 from July 4th 2014, in the Science section Idéer, p. 12.
Algorithms. On July 9, the Gödel prize in theoretical computer science at a conference in Copenhagen. This year, the award goes to an algorithm facilitating personalized information retrieval. The existence of this kind of efficient algorithms determines our perception of digital reality.
Information technology has led to two quite radical changes in the way we access information. In the “old days” of the mid-nineties, information was organized like a library. The web service Yahoo! had categorized the web’s pages by topic. If you wanted to find anything about Paris, you had to navigate through a hierarchy—Geography, Europe, France, cities, Paris—just like finding the right shelf in the public library. Google’s page ranking algorithm PageRank from 1998 was a fundamental rethinking of this centuries-old tradition. Now you could write “Paris” into a search field, and immediately got a list of all web pages about Paris, ranked by quality—a little like walking up to the librarian, muttering a single word, and let her do all the lookup for you. Categories and hierarchies were gone! Google’s algorithm was so fundamental that today it seems inconceivable to search the web in any other way than by ranked text search.
The next change came in the late noughties: personalized search results. The term “Paris Hilton” means something completely different to Jürgen, a German tourist in a European city than to Ashley, an American teenager. Therefore, search engines like Google try to adjust the ranked list of answers to the individual user, based on the digital traces we have left. The user’s age and geographical residence is enough to recommend the right hotel for Jürgen and the right celebrity Ashley. Google’s vision is to answer the question before the user has made it. To keep with the library metaphor, this is equivalent to walking up to the librarian and get stuck a book in your hand before you open your mouth.
Personalization is the information reality we live in right now. Google personalises web searches, Amazon personalises Amazon book recommendations, Netflix personalises movies, and Spotify personalises music. Dating services have always been personalised and the large web-based dating sites have obviously jumped the personalisation bandwagon as well. Personalization has many interesting potential consequences, such as the creation of so-called potential filter bubbles in information dissemination, which WA has covered earlier [It’s all about you, WA 20, 2011].
The algorithms involved in making personalisation work have a direct and significant impact on how we perceive the world—for better or worse. One of them, the Threshold Algorithm for top-k queries by Ron Fagin, Amnon Lotem, and Moni Naor will receive theoretical Computer Science’s most distinguished honour, the Gödel Prize, at a conference in Copenhagen.
While algorithms take a lot of choices for us and know more and more about us, the public knows precious little about algorithms. However, an algorithm is neither artificial intelligence, black magic, or “what a programmer says when he does not want to explain what happens.” Algorithms are problem-solving methods, expressed in terms so precise that it can be performed by a computer—which you can think of as an infinitely patient, obedient, and annoyingly literal-minded twelve-year-old.
Before the personalisation wave from 2010, a search service could content itself with sorting information by just one parameter. For example, a bestseller list ranks books by how many books were sold. But personalisation opens the possibility to rank and categorize books by price, age, number of pages, language, genre, etc. If a book was written by a Danish author may be very relevant to Mrs Hansen is totally irrelevant to Mrs Jensen. And Mrs Hansen may hate crime novels while Mrs Jensen loves them. Personalisation is therefore implemented as a weighting of the product’s many parameters, with the weighting tailored to Mrs Jensen.
But how do we now keep our books when each user has his or her own view of what is relevant? In principle, this is very easy to solve: Run through all books, calculate Mrs Jensen’s score for each of them, sort the results by score, and present her with the top 20 results.
Unfortunately, this takes too long.
This bears repeating, because it is contrary to our intuition: Sorting takes too long.
How can this be? Aren’t computers becoming faster and faster? Yes, they are. Technological developments increased processor speeds dramatically since Princeton mathematician John von Neumann’s ENIAC from the end of World War II, which weighed 30 tons and consisted of 17468 vacuum tubes. But at the same time, data volume has also grown bigger and bigger. In fact, computer memory and problem size have grown even faster than the processor speed. You could say that over time, computers have gotten slower and slower compared to the computational problems we expose them to. As a rule of thumb, “it takes a computer a second to touch all data” has been true then and now. Thus, the algorithmic challenge is an aspect big data: we barely have time to answer even the simplest questions on very large data volumes.
Of course, one might say that a second may not be such a long time anyway. Mrs Jensen would reasonably be expected to wait a second until we have reorganised the entire library according to her preferences. But the problem is that our hypothetical web service hopes to have more customers than just Mrs Jensen; we would like to serve more than one user per second. In fact, we hope for a few million per second. And we would not want to invest in a few million computers to serve these customers.
On the other hand, we need to perform the calculation immediately, because there are too many different customers for us to prepare a personalised copy of the entire library for each of them. That approach would require a copy of the entire data set for each user, and we don’t have that much storage space.
In summary, the algorithmic problem occurs because the data set is too large to sort it straight away, and there are too many customers to sort everything in advance.
Without such an algorithmic solution to this problem there would be no personalisation. In fact there are many computational problems, where exactly this occurs: algorithms research has failed to find a clever solution. Certain desirable algorithms simply do not exist. Sometimes, this fact of nonexistence can also be proved.
The Austrian logician Kurt Gödel’s own research in the 1930s, long before Computer Science was its own discipline, focused precisely on stringent mathematical arguments for the absence of algorithms for specific problems.
But it turns out that four our problem at hand, an efficient algorithm does, called the Threshold Algorithm. It is short and elegant, and you try to come up with it yourself of google it. The elegant solution requires an data structure consisting of the entire database sorted for each parameter individually, as well as some additional bookkeeping and cross-references between the lists. In these sorted lists a few steps suffice to find the answer. The algorithm is easy to follow and easy to implement as program code. The analysis of correctness and complexity requires a little more computer science background.
Theoretical computer science is a mathematical discipline, so the algorithm is published along with a rigorous analysis of efficiency and a formal proof of optimality. The Threshold Algorithm was discovered and published in 2001 by three Israeli and American researchers.
The Gödel Prize is an award in theoretical computer science, so it has probably been decisive that the mathematical aspects of the outcome meets the aesthetic requirements of Mathematics for simplicity. The Threshold Algorithm can be described in 10 lines, and the authors were originally worried about getting the manuscript rejected because it looked too easy. This did not happen. The article was immediately admitted to the ACM Symposium on Principles of Database Systems in 2001, won the best paper award, won the symposium’s Test-of-Time Award 10 years later and now the Gödel Prize.
Many algorithms in the computer science literature are pure basic research and never find their way to a product, but the Threshold Algorithm is a good example of one of the now many pieces of theoretical computer science that places a direct influence on our reality, as reality becomes more and more algorithmic. They choose our next book, our next news and perhaps our next relationship.
Gödelprisen given annually since 1993 for outstanding achievements in theoretical computer science. The award ceremony will take place alternately between North America and Europe.
Fagin, Naor, Lotem, Optimal aggregation algorithms for middleware, J. Comput. Syst. Sc. 66 (2003) 614-656. [PDF]