Category Archives: Exposition

In the 1950s, we were told that we needed the Bomb to protect us from totalitarianism. Today, we’re told that we need totalitarianism to protect us from the Bomb.

Edit to add: I’m told that this quote isn’t pithy enough. It doesn’t tweet. Here’s a shortened version for the superficial times we live in. Barbaric, I now…

In the 50s we were told the Bomb protected us from totalitarianism. Now we’re told totalitarianism protects us from the Bomb.

Oscar Wilde never had that problem.

E-rösta är en dålig idé

(Ursprungligt publicerad 28 april 2013 på Sydsvenskan)

Det är en enorm skillnad mellan att shoppa, deklarera och sköta sina bankaffärer via nätet och att rösta i ett demokratiskt val. Det skriver Thore Husfeldt, professor i datalogi, och tar avstånd från vallagskommitténs förslag om e-röstning.
En majoritet i vallagskommittén har föreslagit att elektronisk röstning ska ske på försök i ett antal svenska kommuner under valet 2018. Allt annat på nätet sägs ju fungera utmärkt och smidigt, så varför ska vi inte e-rösta? Vi kan ju e-deklarera, e-handla och e-banka.

Och visst låter det bra.

Men det är en enorm skillnad mellan att shoppa, deklarera och sköta sina bankaffärer via nätet och att rösta i ett demokratiskt val.

Det är sällan vi dataloger avfärdar informationsteknologi. Men på den här punkten är vi överens:

Öppna och hemliga val är olämpliga föremål för digitalisering, oavsett om det skulle ske via internet eller med hjälp av maskiner i vallokalen.

Varför då? Det finns många anledningar, men låt mig redogöra för två:

Hemlighet och transparens.

Ta banktransaktioner – de går att sköta digitalt eftersom de inte är hemliga. Visst uppstår fel och problem, men både kund och bank får kvitton som de kan jämföra med kontoutdragen. Ingen behöver blint lita på att transaktionen lyckades, eftersom resultatet kan kontrolleras i efterhand. När ett fel uppstår, blir det upptäckt och kan åtgärdas.

Hemliga val är den diametrala motsatsen till banktransaktioner:

Väljaren får inget kvitto (annars vore valet inte fritt) och vallokalen får inget kvitto (annars vore valet inte anonymt).

Kritiken mot e-röstning är alltså domänspecifik och har ingenting med samhällets generella digitala kompetens att göra. Det finns gott om icke-hemliga val, till exempel när riksdagen voterar. Sådana val kan man digitalisera: små lampor i riksdagen visar varje ledamots röst och rösträkningen sker elektroniskt.

Den andra anledningen till att e-röstande är oacceptabelt i en demokrati är bristen på transparens.

Vi litar på nuvarande valsystem eftersom alla kan övertyga sig om att röstningen går rätt till. Hela processen är öppen och observeras av en rad individer: dig, mig, fru Svensson eller en valobservatör. Bland annat visar man upp den tomma valurnan på valmorgonen för att skydda processen mot i förväg påfyllda valurnor.

I många länder används rentav genomskinliga röstlådor – där kan man tala om bokstavlig transparens.

Men hur ser då den digitala motsvarigheten till denna kontroll ut?

Den är betydligt mer komplicerad, i själva verket omöjlig.

Förutsatt att e-röstsystemet använder öppen källkod kan i alla fall den teknologiska eliten, det vill säga ett mycket litet fåtal, läsa koden och se att det står set votecount = 0 på rad 541 548.

Men programvara är ett komplext system. Om jag skulle granska vore jag tvungen att läsa hela programmet för att försäkra mig om att rad 541 548 faktiskt utförs vid rätt tidpunkt, och att ingen annan rad ändrar votecount på något klurigt sätt.

Det är i stort omöjligt att granska så mycket kod. Men även om det vore möjligt, så beror e-röstningssystemets funktion på en massa annat: kompilatorer, operativsystem, drivrutiner, ända ner till hårdvaran.

Det handlar om miljontals rader kod, skriven av tiotusen människor i ett dussin programspråk. Några rader är 40 år gamla, andra ändrades igår. Många innehåller fel på grund av slarv eller illvilja. Varje rad kan ändra votecount.

En armé av dataloger skulle behöva decennier för att granska allt. Men även om vi hade lyckats, skulle vi inte ha någon garanti för att det är det granskade systemet som faktiskt används på valdagen. Det kan ha bytts ut natten innan.

Det är alltså inte så att bara experter kan observera och kontrollera ett elektroniskt val – inte ens experter kan det.

Inga tekniska framsteg kan ändra på detta. Tvärtom blir informationsteknologin mer och mer komplex och den digitala valurnan mer och mer ogenomskinlig.

Så e-röstning är som allt annat på nätet. Det är varken hemligt eller genomskådligt.

Views of Impagliazzo’s Five Worlds

I gave a popular science talk this morning about which computational world we live in. The main conceit was to couple the question about the existence of efficient algorithms for NP-hard problems to utopias in science fiction and the technological singularity. I think this worked out pretty well.

The talk was recorded by Swedish TV and will be broadcast somewhen during the Spring of 2012 on their education and facts channel Kunskapskanalen. Because of the potentially high impact I decided to apply a bit more spit and polish to the visual side of the talk. In particular, I made 5 posters to represent Impagliazzo’s five worlds of computation. These worlds provide a metaphor that I find more intellectually stimulating than the relatively stuffy formulation, “Is the complexity class P equal to the complexity class NP?”.

I think some of them came out well, even typographically. (Five points for guessing which typeface was used for Heuristica.) I hereby release them under the CC BY-SA 3.0 license; maybe somebody else can use them in a similar talk. The talk was in Swedish (or at least in whatever language I use to approximate Swedish); if an anglophone wants Kryptomanien changed to Cryptomania, or wants the PDFs, I’m happy to oblige.

Next up: mugs, mouse mats, lingerie, and a perfume series of five fragrances.

References

Five worlds of computation are from Russell Impagliazzo: A Personal View of Average-Case Complexity. Structure in Complexity Theory Conference 1995: 134-147. [online PS]

Image sources: The dice are by ed_g2s at Wikimedia Commons. HAL 9000 is by Cryteria at Wikimedia Commons.

More Algorithms in Svenska Dagbladet

“Coded cultural choice” on the cover of SvD’s Arts section, illustrated by badly indented Java code.

My quixotic attempts to insert computer science into the memetic space of the literati continue to bear fruit. Today’s score is 3 pages in Sweden’s major newspaper Svenska Dagbladet. Of course, this only works on a day where the entire cultural elite is still hung over from actually having a life at New Years’s Eve, so there is a narrow window of opportunity where no other culture is being produced.

The article is about the effect of algorithms on the consumption of culture. One of the arguments that I’ve been trying to make for a while is that (a) the human condition is increasingly determined by access to information (b) the main force affecting information is algorithms. Thus algorithms are a worth understanding for other reasons than (a) intellectual curiosity or (b) technological progress.

The web version of the article is online at svd.se.

Such an article are the result of an interview that takes over an hour, and where I say a lot of things about a lot of things, and some subsequent emails. What ends up in the final article is a bit of hit and miss. This time, I’m pretty happy with the outcome.

Some highlights and comments:

  • My physics envy is in full display: “We teach 1960’s science like physics and chemistry in school, but you don’t learn more about algorithms now than 20 years ago.” This statement is of course not universally true; many countries do have programming and other computer science as part of the high school curriculum. But even with that proviso I think it’s safe to say that “what science should be thought in school” has changed relatively little since Sputnik. But the point of the article is of course something else: algorithms have an effect on culture (and democracy and whatnot), so the bold claim is that algorithmic thinking should be part of the Social Science curriculum. (It goes without saying that recursion ought to stand proudly next to Pascal’s law in the Science and Maths curriculum as well.)
  • The article contains a nice-looking attempt at actually Explaining What’s Going On. You can see some pseudocode in the web version, where nice-looking brand icons serve as objects that are polled for various signals about previous user behaviour. There is even a collaborative filtering aspect (since the hypothetical Facebook object is queried about what your friends like). Given the constraints under which such a page is produced, I think this came out well. The real point is of course that this code is not written by a human, but by another algorithm, so “the algorithm” performs at a meta-level. I did not find a good way to communicate that, so I didn’t try.
  • The code on the section’s front page is from the very nice data mining research done at ITU: Andrea Campagna and Rasmus Pagh, Finding Associations and Computing Similarity via Biased Pair Sampling, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009 pp 61-70. Proper indentation or choice of a non-proportional font is not something the newspaper graphics guy cares about. (Hell, my own students don’t.) I proudly showed this to ITU’s vice chancellor Mads Tofte, but he “would have been more impressed if it had been in ML.”
  • Lest somebody misattributes pithy quotations to quotable me, the phrase “program or be programmed” is from Douglas Rushkoff.

The Most Important Movies Of All Time

Does this sound familiar to you? You are watching a Simpsons episode with friends. Bart, waking up from an out-of-body experience where he visited Hell, says,

“I was miles and miles and miles away, writhing in agony in the pits of Hell! And you were there, and you and you and you.”

Everybody snickers knowingly and lets out a satisfied sigh. Ah, the subtext!

But you? You just missed another movie reference! Social ostracism awaits.

But fret not! There are not that many movies, and in fact most of them tend to cite the same classics over and over again. So, as a special service, her is the list of the most important movies ever made, ordered by a carefully constructed principle I call FilmRank. The ranking is the result of an algorithm, similar to what Google uses to rank web pages, and based on the Internet Movie Database’s list of movie references. I haven’t tinkered with the outcome. If you want to read more about the methodology, look at my related post

So, without further ado, here’s the list:

  1. The Wizard of Oz (1939)
  2. Shadow of a Doubt (1943)
  3. King Kong (1933)
  4. The Merry Widow (1934)
  5. Psycho (1960)
  6. Cabiria (1914)
  7. Star Wars (1977)
  8. Metropolis (1927)
  9. Snow White and the Seven Dwarfs (1937)
  10. Casablanca (1942)
  11. 2001: A Space Odyssey (1968)
  12. The Birth of a Nation (1915)
  13. Snow White (1916/I)
  14. Chang: A Drama of the Wilderness (1927)
  15. Frankenstein (1931)
  16. Vertigo (1958)
  17. Jaws (1975)
  18. The Godfather (1972)
  19. Gone with the Wind (1939)
  20. The Lost World (1925)
  21. Dr. No (1962)
  22. The Searchers (1956)
  23. A Clockwork Orange (1971)
  24. Citizen Kane (1941)
  25. Nosferatu, eine Symphonie des Grauens (1922)
  26. Rosemary’s Baby (1968)
  27. Night of the Living Dead (1968)
  28. Bronenosets Potyomkin (1925)
  29. The Exorcist (1973)
  30. The Terminator (1984)
  31. La dolce vita (1960)
  32. The Scarecrow (1920)
  33. Un chien andalou (1929)
  34. Tarzan the Ape Man (1932)
  35. Taxi Driver (1976)
  36. The Wizard of Oz (1925)
  37. The Poor Little Rich Girl (1917)
  38. The Battle at Elderbush Gulch (1913)
  39. Dracula (1931)
  40. Il buono, il brutto, il cattivo. (1966)
  41. Shichinin no samurai (1954)
  42. College (1927)
  43. Apocalypse Now (1979)
  44. His Majesty, the Scarecrow of Oz (1914)
  45. La grande illusion (1937)
  46. The General (1926)
  47. The Lost Patrol (1934)
  48. Singin’ in the Rain (1952)
  49. E.T.: The Extra-Terrestrial (1982)
  50. The Great Train Robbery (1903)

Amazingly, it’s pretty good! There are some really old movies in there because the methodology will tend to favour them in an ever-decreasing pool of available references. But, if I use the same damping factor that Google uses for ranking web pages, this is what you get.

For fun, we can decrease this mysterious factor from .85 to .45. This will decrease the value of long chains of references, giving films like Star Wars more of a chance. Star Wars has tons of references (half a thousand other movies reference it, many more than Oz), but most of them are recent, and haven’t yet amassed the authority that comes from having generations of other movies refer to you. So here’s the second, more modern list:

  1. The Wizard of Oz (1939)
  2. Star Wars (1977)
  3. Psycho (1960)
  4. King Kong (1933)
  5. Shadow of a Doubt (1943)
  6. Casablanca (1942)
  7. 2001: A Space Odyssey (1968)
  8. The Godfather (1972)
  9. Snow White and the Seven Dwarfs (1937)
  10. Metropolis (1927)
  11. Jaws (1975)
  12. Gone with the Wind (1939)
  13. Frankenstein (1931)
  14. Cabiria (1914)
  15. The Birth of a Nation (1915)
  16. The Merry Widow (1934)
  17. Vertigo (1958)
  18. A Clockwork Orange (1971)
  19. Night of the Living Dead (1968)
  20. Citizen Kane (1941)
  21. The Exorcist (1973)
  22. Taxi Driver (1976)
  23. The Terminator (1984)
  24. Rosemary’s Baby (1968)
  25. Apocalypse Now (1979)
  26. E.T.: The Extra-Terrestrial (1982)
  27. Il buono, il brutto, il cattivo. (1966)
  28. Dr. No (1962)
  29. Snow White (1916/I)
  30. Nosferatu, eine Symphonie des Grauens (1922)
  31. Bronenosets Potyomkin (1925)
  32. The Searchers (1956)
  33. Dracula (1931)
  34. Chang: A Drama of the Wilderness (1927)
  35. Tarzan the Ape Man (1932)
  36. Scarface (1983)
  37. Un chien andalou (1929)
  38. Shichinin no samurai (1954)
  39. La dolce vita (1960)
  40. The Shining (1980)
  41. Singin’ in the Rain (1952)
  42. The Texas Chain Saw Massacre (1974)
  43. Star Wars: Episode V – The Empire Strikes Back (1980)
  44. The Lost World (1925)
  45. Pulp Fiction (1994)
  46. Rocky (1976)
  47. Halloween (1978)
  48. It’s a Wonderful Life (1946)
  49. The Wizard of Oz (1925)
  50. Deliverance (1972)

As you see, little Dorothy still wins, but Darth Vader breathes down her neck! I like this list a bit better than the previous, so .45 seems to be the right damping factor for this particular network. Of course, there are many other PageRank-like algorithms in the literature, each with obscure constants to fiddle with. (Google used a long time fiddling with these things before they went public.) But I assume you’ll end up with pretty much the same movies as on the FilmRank lists.

I find the FilmRank lists just as inspiring and authoritative as many similar lists compiled by movie critics or fan votes. But this one is produced completely by machine, no human heart has acted as a mediator. Maybe that’s why Metropolis scores so high?

How FilmRank Works

Ranking—in parcticular, the ranking of information such as web pages—is one of the most visible effects of algorithms to people’s lives. Imagine navigating the internet like we did in 1994, before Google’s PageRank algorithm!

I enjoy explaining how Google works; in fact I have a popular science presentation,

that I use quite a lot.

I think I’ve found a new way of explaining the essence of the PageRank idea, using movies. The outcome, a ranking of all films ever made, is interesting in itself.

Here’s how it works.

Movies can refer to other movies. When Mary McFly in Back to the Future III finds himself alone with a gun and a mirror, he quotes De Niro’s character in Taxi Driver,

“You talkin’ to me? You talkin’ to me? You talkin’ to me? Then who the hell else are you talkin’ to? You talkin’ to me? Well I’m the only one here. Who the fuck do you think you’re talking to?”

In fact, the line has an entire Wikipedia article.
And De Niro’s Army jacket has a patch reading “King Kong Company 1968-1970”, referencing a classic 1933 movie about another monster on the rampage in New York. (This reference is quite deliberate, the patch is described on page 1 in the script, instead of being a random detail inserted by a costume designer.)

Like the basic idea of PageRank, we can use these references as a measure of quality. A film must be important if other filmmakers want to reference it. Like all measures of quality, it is open to ridicule and skepticism. But it is different from other measures, such as box office success or even critical reception. With FilmRank, I put my trust in the good tastes of other filmmakers, just as PageRank puts its trust into other web pages: PageRank is based on incoming links, not on number of visitors.

If you just want to see the results and don’t care about the methodology, read my companion post

Weighted References

What I like about FilmRank is that I can explain the weighting. Assume first that I just count the number of references to a movie. Which film wins? What is the most referenced film of all time? Well, according to the Internet Movie Database, here are the top three:

  1. Star Wars (1977), 570 references
  2. The Wizard of Oz (1939), 554 references
  3. Psycho (1960), 360 references

The oft-cited shower scene from Alfred Hitchcock’s Psycho


In this list, every reference counts the same. E.T.’s reference to Star Wars counts just as much as Eragon’s, even though the former is a much more important movie than the latter.

This is where the weighting comes in. FilmRank works in rounds, as follows. Initially, all films (there are well over 60,000) receive the same rank. Then, each film inherits rank from the films referencing it. For example, if A references B and C, it confers rank in equal measures to B and C, increasing their rank. If these films in turn refer to other films, the rank gets passed on and on and on.

An equivalent way of seeing this is the random moviegoer model. Start viewing some random movie. After it’s done, randomly select on of the movies it referenced and go see that one. Keep doing that hundreds of thousands of times. (If there were no references in the last movie, start over.) In the end, you end up watching certain movies again and again. This frequency is the FilmRank. There is a certain probability (Google uses 85% for PageRank), called the damping factor that you grow tired of your current project and start all over anyway.

Quite formally, the behaviour is a Markov chain modelling a random walk on the directed graph whose vertices are movies, and whose edges are references. The FilmRank is the stable distribution of that random process.

Comparison with PageRank

Dorothy in stunning colour, but not in Kansas

Remember that this is only for fun, and to explain how PageRank works. (Though I like the results.) The huge difference between PageRank and FilmRank is that PageRank uses no human manipulation. The links between web pages are already there, and not subject to interpretation. In contrast, it is a matter of opinion whether a reference to another movie is intentional or not. “You’re not in Kansas anymore” is certainly deliberate, but are all quotes of “I am your father” really meant to remind you of Darth Vader?

Thus, FilmRank loses some of its purity, as well as its efficiency. Moreover, it can be gambled by maliciously editing the underlying database. Moreover, FilmRank is gloriously unnecessary: there are not very many films (only tens of thousands, with few being added every day), so human intervention suffices to rank them and make authoritative lists of top films based on individual movie critics’s taste. Not so with the web: its size made it impossible to rank by the middle of the 1990s, so a fully automatic system like PageRank is essential if we are to be able to navigate it.

Finally, there’s a huge topological difference between the network of linked web pages and the network of film references: The web is very cyclic, whereas films cannot refer to films “in the future”. You can refer to The Time Machine, but you can’t build one. This means that FilmRank will tend to favour older movies, simply by the fact that they are old enough to have assembled lots of references. However, I find this quite refreshing; many other formal measures of quality (such as box office success) will artificially favour newer movies because of inflation.

Implementation

So here’s what I did. The Internet Movie Database maintains exactly the list of movie references we need, and is freely available and can be downloaded. I decided to throw out TV shows and video games, and to honour the connections “references”, “features”, and “remake of”. Then it was a simple matter of coding a mindless implementation of the algorithm in Python:


alpha = .85 # damping factor

q= {}
for i in range(20):
    for v in G: q[v]= 0
    r = 0 # will be sum p[v] over all dangling vertices
    for v in G:
        q[v] += p[v] * (1-alpha)
        if dout[v] == 0:                     # dangling v? then all w get some..
            r += p[v]
        else:                              # ..else v's successors share
            delta=alpha* p[v]/dout[v]
            for w in out[v]: q[w] += delta
    for v in G: q[v] += r/n     # add contribution of all dangling vertices
    p= q.copy()

Probably not very good Python, comments are welcome. The reason to accumulate the contributions of the dangling vertices (instead of spreading them out over the rest of the graph) is of course to avoid quadratic running time. Also, PageRank seems to stabilise to its final distribution after a handful of iterations, so 20 iterations must be enough for our relatively small graph. So I think there’s no need for fancy-shmancy sparse matrix algebra.