Ranking—in parcticular, the ranking of information such as web pages—is one of the most visible effects of algorithms to people’s lives. Imagine navigating the internet like we did in 1994, before Google’s PageRank algorithm!
I enjoy explaining how Google works; in fact I have a popular science presentation,
that I use quite a lot.
I think I’ve found a new way of explaining the essence of the PageRank idea, using movies. The outcome, a ranking of all films ever made, is interesting in itself.
Here’s how it works.
Movies can refer to other movies. When Mary McFly in Back to the Future III finds himself alone with a gun and a mirror, he quotes De Niro’s character in Taxi Driver,
“You talkin’ to me? You talkin’ to me? You talkin’ to me? Then who the hell else are you talkin’ to? You talkin’ to me? Well I’m the only one here. Who the fuck do you think you’re talking to?”
In fact, the line has an entire Wikipedia article.
And De Niro’s Army jacket has a patch reading “King Kong Company 1968-1970”, referencing a classic 1933 movie about another monster on the rampage in New York. (This reference is quite deliberate, the patch is described on page 1 in the script, instead of being a random detail inserted by a costume designer.)
Like the basic idea of PageRank, we can use these references as a measure of quality. A film must be important if other filmmakers want to reference it. Like all measures of quality, it is open to ridicule and skepticism. But it is different from other measures, such as box office success or even critical reception. With FilmRank, I put my trust in the good tastes of other filmmakers, just as PageRank puts its trust into other web pages: PageRank is based on incoming links, not on number of visitors.
If you just want to see the results and don’t care about the methodology, read my companion post
What I like about FilmRank is that I can explain the weighting. Assume first that I just count the number of references to a movie. Which film wins? What is the most referenced film of all time? Well, according to the Internet Movie Database, here are the top three:
- Star Wars (1977), 570 references
- The Wizard of Oz (1939), 554 references
- Psycho (1960), 360 references
The oft-cited shower scene from Alfred Hitchcock’s Psycho
In this list, every reference counts the same. E.T.
’s reference to Star Wars
counts just as much as Eragon
’s, even though the former is a much more important movie than the latter.
This is where the weighting comes in. FilmRank works in rounds, as follows. Initially, all films (there are well over 60,000) receive the same rank. Then, each film inherits rank from the films referencing it. For example, if A references B and C, it confers rank in equal measures to B and C, increasing their rank. If these films in turn refer to other films, the rank gets passed on and on and on.
An equivalent way of seeing this is the random moviegoer model. Start viewing some random movie. After it’s done, randomly select on of the movies it referenced and go see that one. Keep doing that hundreds of thousands of times. (If there were no references in the last movie, start over.) In the end, you end up watching certain movies again and again. This frequency is the FilmRank. There is a certain probability (Google uses 85% for PageRank), called the damping factor that you grow tired of your current project and start all over anyway.
Quite formally, the behaviour is a Markov chain modelling a random walk on the directed graph whose vertices are movies, and whose edges are references. The FilmRank is the stable distribution of that random process.
Comparison with PageRank
Dorothy in stunning colour, but not in Kansas
Remember that this is only for fun, and to explain how PageRank works. (Though I like the results.) The huge difference between PageRank and FilmRank is that PageRank uses no human manipulation. The links between web pages are already there, and not subject to interpretation. In contrast, it is a matter of opinion whether a reference to another movie is intentional or not. “You’re not in Kansas anymore” is certainly deliberate, but are all quotes of “I am your father” really meant to remind you of Darth Vader?
Thus, FilmRank loses some of its purity, as well as its efficiency. Moreover, it can be gambled by maliciously editing the underlying database. Moreover, FilmRank is gloriously unnecessary: there are not very many films (only tens of thousands, with few being added every day), so human intervention suffices to rank them and make authoritative lists of top films based on individual movie critics’s taste. Not so with the web: its size made it impossible to rank by the middle of the 1990s, so a fully automatic system like PageRank is essential if we are to be able to navigate it.
Finally, there’s a huge topological difference between the network of linked web pages and the network of film references: The web is very cyclic, whereas films cannot refer to films “in the future”. You can refer to The Time Machine, but you can’t build one. This means that FilmRank will tend to favour older movies, simply by the fact that they are old enough to have assembled lots of references. However, I find this quite refreshing; many other formal measures of quality (such as box office success) will artificially favour newer movies because of inflation.
So here’s what I did. The Internet Movie Database maintains exactly the list of movie references we need, and is freely available and can be downloaded. I decided to throw out TV shows and video games, and to honour the connections “references”, “features”, and “remake of”. Then it was a simple matter of coding a mindless implementation of the algorithm in Python:
alpha = .85 # damping factor
for i in range(20):
for v in G: q[v]= 0
r = 0 # will be sum p[v] over all dangling vertices
for v in G:
q[v] += p[v] * (1-alpha)
if dout[v] == 0: # dangling v? then all w get some..
r += p[v]
else: # ..else v's successors share
for w in out[v]: q[w] += delta
for v in G: q[v] += r/n # add contribution of all dangling vertices
Probably not very good Python, comments are welcome. The reason to accumulate the contributions of the dangling vertices (instead of spreading them out over the rest of the graph) is of course to avoid quadratic running time. Also, PageRank seems to stabilise to its final distribution after a handful of iterations, so 20 iterations must be enough for our relatively small graph. So I think there’s no need for fancy-shmancy sparse matrix algebra.