Category Archives: Exposition

More Algorithms in Svenska Dagbladet

“Coded cultural choice” on the cover of SvD’s Arts section, illustrated by badly indented Java code.

My quixotic attempts to insert computer science into the memetic space of the literati continue to bear fruit. Today’s score is 3 pages in Sweden’s major newspaper Svenska Dagbladet. Of course, this only works on a day where the entire cultural elite is still hung over from actually having a life at New Years’s Eve, so there is a narrow window of opportunity where no other culture is being produced.

The article is about the effect of algorithms on the consumption of culture. One of the arguments that I’ve been trying to make for a while is that (a) the human condition is increasingly determined by access to information (b) the main force affecting information is algorithms. Thus algorithms are a worth understanding for other reasons than (a) intellectual curiosity or (b) technological progress.

The web version of the article is online at

Such an article are the result of an interview that takes over an hour, and where I say a lot of things about a lot of things, and some subsequent emails. What ends up in the final article is a bit of hit and miss. This time, I’m pretty happy with the outcome.

Some highlights and comments:

  • My physics envy is in full display: “We teach 1960’s science like physics and chemistry in school, but you don’t learn more about algorithms now than 20 years ago.” This statement is of course not universally true; many countries do have programming and other computer science as part of the high school curriculum. But even with that proviso I think it’s safe to say that “what science should be thought in school” has changed relatively little since Sputnik. But the point of the article is of course something else: algorithms have an effect on culture (and democracy and whatnot), so the bold claim is that algorithmic thinking should be part of the Social Science curriculum. (It goes without saying that recursion ought to stand proudly next to Pascal’s law in the Science and Maths curriculum as well.)
  • The article contains a nice-looking attempt at actually Explaining What’s Going On. You can see some pseudocode in the web version, where nice-looking brand icons serve as objects that are polled for various signals about previous user behaviour. There is even a collaborative filtering aspect (since the hypothetical Facebook object is queried about what your friends like). Given the constraints under which such a page is produced, I think this came out well. The real point is of course that this code is not written by a human, but by another algorithm, so “the algorithm” performs at a meta-level. I did not find a good way to communicate that, so I didn’t try.
  • The code on the section’s front page is from the very nice data mining research done at ITU: Andrea Campagna and Rasmus Pagh, Finding Associations and Computing Similarity via Biased Pair Sampling, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009 pp 61-70. Proper indentation or choice of a non-proportional font is not something the newspaper graphics guy cares about. (Hell, my own students don’t.) I proudly showed this to ITU’s vice chancellor Mads Tofte, but he “would have been more impressed if it had been in ML.”
  • Lest somebody misattributes pithy quotations to quotable me, the phrase “program or be programmed” is from Douglas Rushkoff.

The Most Important Movies Of All Time

Does this sound familiar to you? You are watching a Simpsons episode with friends. Bart, waking up from an out-of-body experience where he visited Hell, says,

“I was miles and miles and miles away, writhing in agony in the pits of Hell! And you were there, and you and you and you.”

Everybody snickers knowingly and lets out a satisfied sigh. Ah, the subtext!

But you? You just missed another movie reference! Social ostracism awaits.

But fret not! There are not that many movies, and in fact most of them tend to cite the same classics over and over again. So, as a special service, her is the list of the most important movies ever made, ordered by a carefully constructed principle I call FilmRank. The ranking is the result of an algorithm, similar to what Google uses to rank web pages, and based on the Internet Movie Database’s list of movie references. I haven’t tinkered with the outcome. If you want to read more about the methodology, look at my related post

So, without further ado, here’s the list:

  1. The Wizard of Oz (1939)
  2. Shadow of a Doubt (1943)
  3. King Kong (1933)
  4. The Merry Widow (1934)
  5. Psycho (1960)
  6. Cabiria (1914)
  7. Star Wars (1977)
  8. Metropolis (1927)
  9. Snow White and the Seven Dwarfs (1937)
  10. Casablanca (1942)
  11. 2001: A Space Odyssey (1968)
  12. The Birth of a Nation (1915)
  13. Snow White (1916/I)
  14. Chang: A Drama of the Wilderness (1927)
  15. Frankenstein (1931)
  16. Vertigo (1958)
  17. Jaws (1975)
  18. The Godfather (1972)
  19. Gone with the Wind (1939)
  20. The Lost World (1925)
  21. Dr. No (1962)
  22. The Searchers (1956)
  23. A Clockwork Orange (1971)
  24. Citizen Kane (1941)
  25. Nosferatu, eine Symphonie des Grauens (1922)
  26. Rosemary’s Baby (1968)
  27. Night of the Living Dead (1968)
  28. Bronenosets Potyomkin (1925)
  29. The Exorcist (1973)
  30. The Terminator (1984)
  31. La dolce vita (1960)
  32. The Scarecrow (1920)
  33. Un chien andalou (1929)
  34. Tarzan the Ape Man (1932)
  35. Taxi Driver (1976)
  36. The Wizard of Oz (1925)
  37. The Poor Little Rich Girl (1917)
  38. The Battle at Elderbush Gulch (1913)
  39. Dracula (1931)
  40. Il buono, il brutto, il cattivo. (1966)
  41. Shichinin no samurai (1954)
  42. College (1927)
  43. Apocalypse Now (1979)
  44. His Majesty, the Scarecrow of Oz (1914)
  45. La grande illusion (1937)
  46. The General (1926)
  47. The Lost Patrol (1934)
  48. Singin’ in the Rain (1952)
  49. E.T.: The Extra-Terrestrial (1982)
  50. The Great Train Robbery (1903)

Amazingly, it’s pretty good! There are some really old movies in there because the methodology will tend to favour them in an ever-decreasing pool of available references. But, if I use the same damping factor that Google uses for ranking web pages, this is what you get.

For fun, we can decrease this mysterious factor from .85 to .45. This will decrease the value of long chains of references, giving films like Star Wars more of a chance. Star Wars has tons of references (half a thousand other movies reference it, many more than Oz), but most of them are recent, and haven’t yet amassed the authority that comes from having generations of other movies refer to you. So here’s the second, more modern list:

  1. The Wizard of Oz (1939)
  2. Star Wars (1977)
  3. Psycho (1960)
  4. King Kong (1933)
  5. Shadow of a Doubt (1943)
  6. Casablanca (1942)
  7. 2001: A Space Odyssey (1968)
  8. The Godfather (1972)
  9. Snow White and the Seven Dwarfs (1937)
  10. Metropolis (1927)
  11. Jaws (1975)
  12. Gone with the Wind (1939)
  13. Frankenstein (1931)
  14. Cabiria (1914)
  15. The Birth of a Nation (1915)
  16. The Merry Widow (1934)
  17. Vertigo (1958)
  18. A Clockwork Orange (1971)
  19. Night of the Living Dead (1968)
  20. Citizen Kane (1941)
  21. The Exorcist (1973)
  22. Taxi Driver (1976)
  23. The Terminator (1984)
  24. Rosemary’s Baby (1968)
  25. Apocalypse Now (1979)
  26. E.T.: The Extra-Terrestrial (1982)
  27. Il buono, il brutto, il cattivo. (1966)
  28. Dr. No (1962)
  29. Snow White (1916/I)
  30. Nosferatu, eine Symphonie des Grauens (1922)
  31. Bronenosets Potyomkin (1925)
  32. The Searchers (1956)
  33. Dracula (1931)
  34. Chang: A Drama of the Wilderness (1927)
  35. Tarzan the Ape Man (1932)
  36. Scarface (1983)
  37. Un chien andalou (1929)
  38. Shichinin no samurai (1954)
  39. La dolce vita (1960)
  40. The Shining (1980)
  41. Singin’ in the Rain (1952)
  42. The Texas Chain Saw Massacre (1974)
  43. Star Wars: Episode V – The Empire Strikes Back (1980)
  44. The Lost World (1925)
  45. Pulp Fiction (1994)
  46. Rocky (1976)
  47. Halloween (1978)
  48. It’s a Wonderful Life (1946)
  49. The Wizard of Oz (1925)
  50. Deliverance (1972)

As you see, little Dorothy still wins, but Darth Vader breathes down her neck! I like this list a bit better than the previous, so .45 seems to be the right damping factor for this particular network. Of course, there are many other PageRank-like algorithms in the literature, each with obscure constants to fiddle with. (Google used a long time fiddling with these things before they went public.) But I assume you’ll end up with pretty much the same movies as on the FilmRank lists.

I find the FilmRank lists just as inspiring and authoritative as many similar lists compiled by movie critics or fan votes. But this one is produced completely by machine, no human heart has acted as a mediator. Maybe that’s why Metropolis scores so high?

How FilmRank Works

Ranking—in parcticular, the ranking of information such as web pages—is one of the most visible effects of algorithms to people’s lives. Imagine navigating the internet like we did in 1994, before Google’s PageRank algorithm!

I enjoy explaining how Google works; in fact I have a popular science presentation,

that I use quite a lot.

I think I’ve found a new way of explaining the essence of the PageRank idea, using movies. The outcome, a ranking of all films ever made, is interesting in itself.

Here’s how it works.

Movies can refer to other movies. When Mary McFly in Back to the Future III finds himself alone with a gun and a mirror, he quotes De Niro’s character in Taxi Driver,

“You talkin’ to me? You talkin’ to me? You talkin’ to me? Then who the hell else are you talkin’ to? You talkin’ to me? Well I’m the only one here. Who the fuck do you think you’re talking to?”

In fact, the line has an entire Wikipedia article.
And De Niro’s Army jacket has a patch reading “King Kong Company 1968-1970”, referencing a classic 1933 movie about another monster on the rampage in New York. (This reference is quite deliberate, the patch is described on page 1 in the script, instead of being a random detail inserted by a costume designer.)

Like the basic idea of PageRank, we can use these references as a measure of quality. A film must be important if other filmmakers want to reference it. Like all measures of quality, it is open to ridicule and skepticism. But it is different from other measures, such as box office success or even critical reception. With FilmRank, I put my trust in the good tastes of other filmmakers, just as PageRank puts its trust into other web pages: PageRank is based on incoming links, not on number of visitors.

If you just want to see the results and don’t care about the methodology, read my companion post

Weighted References

What I like about FilmRank is that I can explain the weighting. Assume first that I just count the number of references to a movie. Which film wins? What is the most referenced film of all time? Well, according to the Internet Movie Database, here are the top three:

  1. Star Wars (1977), 570 references
  2. The Wizard of Oz (1939), 554 references
  3. Psycho (1960), 360 references

The oft-cited shower scene from Alfred Hitchcock’s Psycho

In this list, every reference counts the same. E.T.’s reference to Star Wars counts just as much as Eragon’s, even though the former is a much more important movie than the latter.

This is where the weighting comes in. FilmRank works in rounds, as follows. Initially, all films (there are well over 60,000) receive the same rank. Then, each film inherits rank from the films referencing it. For example, if A references B and C, it confers rank in equal measures to B and C, increasing their rank. If these films in turn refer to other films, the rank gets passed on and on and on.

An equivalent way of seeing this is the random moviegoer model. Start viewing some random movie. After it’s done, randomly select on of the movies it referenced and go see that one. Keep doing that hundreds of thousands of times. (If there were no references in the last movie, start over.) In the end, you end up watching certain movies again and again. This frequency is the FilmRank. There is a certain probability (Google uses 85% for PageRank), called the damping factor that you grow tired of your current project and start all over anyway.

Quite formally, the behaviour is a Markov chain modelling a random walk on the directed graph whose vertices are movies, and whose edges are references. The FilmRank is the stable distribution of that random process.

Comparison with PageRank

Dorothy in stunning colour, but not in Kansas

Remember that this is only for fun, and to explain how PageRank works. (Though I like the results.) The huge difference between PageRank and FilmRank is that PageRank uses no human manipulation. The links between web pages are already there, and not subject to interpretation. In contrast, it is a matter of opinion whether a reference to another movie is intentional or not. “You’re not in Kansas anymore” is certainly deliberate, but are all quotes of “I am your father” really meant to remind you of Darth Vader?

Thus, FilmRank loses some of its purity, as well as its efficiency. Moreover, it can be gambled by maliciously editing the underlying database. Moreover, FilmRank is gloriously unnecessary: there are not very many films (only tens of thousands, with few being added every day), so human intervention suffices to rank them and make authoritative lists of top films based on individual movie critics’s taste. Not so with the web: its size made it impossible to rank by the middle of the 1990s, so a fully automatic system like PageRank is essential if we are to be able to navigate it.

Finally, there’s a huge topological difference between the network of linked web pages and the network of film references: The web is very cyclic, whereas films cannot refer to films “in the future”. You can refer to The Time Machine, but you can’t build one. This means that FilmRank will tend to favour older movies, simply by the fact that they are old enough to have assembled lots of references. However, I find this quite refreshing; many other formal measures of quality (such as box office success) will artificially favour newer movies because of inflation.


So here’s what I did. The Internet Movie Database maintains exactly the list of movie references we need, and is freely available and can be downloaded. I decided to throw out TV shows and video games, and to honour the connections “references”, “features”, and “remake of”. Then it was a simple matter of coding a mindless implementation of the algorithm in Python:

alpha = .85 # damping factor

q= {}
for i in range(20):
    for v in G: q[v]= 0
    r = 0 # will be sum p[v] over all dangling vertices
    for v in G:
        q[v] += p[v] * (1-alpha)
        if dout[v] == 0:                     # dangling v? then all w get some..
            r += p[v]
        else:                              # ..else v's successors share
            delta=alpha* p[v]/dout[v]
            for w in out[v]: q[w] += delta
    for v in G: q[v] += r/n     # add contribution of all dangling vertices
    p= q.copy()

Probably not very good Python, comments are welcome. The reason to accumulate the contributions of the dangling vertices (instead of spreading them out over the rest of the graph) is of course to avoid quadratic running time. Also, PageRank seems to stabilise to its final distribution after a handful of iterations, so 20 iterations must be enough for our relatively small graph. So I think there’s no need for fancy-shmancy sparse matrix algebra.