Home
Zhuang Zhou. The butterfly effect.
 
[Most Recent Entries] [Calendar View] [Friends]

Below are the 10 most recent journal entries recorded in zhouzhuang's LiveJournal:

    Saturday, November 10th, 2007
    8:05 am
    How to kick The Return of the King out of the IMDB Top 10

    The problem
    : some popular sites (e.g. www.imdb.com , www.rateyourmusic.com ) let many users rate different items (movies, albums). This rating is absolute, e.g. you have to give a movie 1 to 10 points. Then the system compute the average (with a more or less sophisticated formula) of all the ratings. But this method gives an advantage to famous movies/albums. Why? Let's see an example [I assume, here onwards, that every person rates items in good faith].

    Let's assume that 100 persons rate some movies. Everyone uses the same rating scale: 10 for the best movie he has ever seen, 9 for very good movies, 8 for good movies. Now, we could have a situation like this:
    - 10 persons give this rating: Citizen Kane 10, Rear Window 9, The Return of the King (the third episode of The Lord of the Rings) 8
    - 10 persons give this rating: Rear Window 10, Citizen Kane 9, The Return of the King 8
    - 80 persons give this rating: The Return of the King 10 (they haven't seen any movie by Welles or Hitchcock in their lifetime)

    The result, considering the average rating (simple average, in this case), is:
    1. The Return of the King: 9.6
    2. Citizen Kane, Rear Window: 9.5

    But this is simply wrong, because every person that has seen Citizen Kane or Rear Window rates them higher than The Return of the King!

    The solution: consider relative ratings, not absolute ones. Ask the people to compare the item they are rating with some other items. So, this is like a match between movies, or albums. Now, let me quote my source of inspiration ( http://www.maasranking.nl/description.htm ):
    "The result will be a set of linear equations that is based on the number of "wins" against various Opponents. All those equations together constitute a Matrix. Futhermore a solution Vector is created that consists of all the teams' scores, calculated as: Number of Wins minus Number of Losses. The Rating scores are then obtained by inverting the Matrix and multiplying that inverted Matrix by the solution Vector."

    This way, Citizen Kane and Rear Window will be rated higher than The Return of the King. Less famous movies will no more be disadvantaged.

    ---

    Bottom lines: I'm sorry for the bad usage of the English language. If you know about anyone that already had this idea, please tell me.

    [Edit] Further bottom line: the movies I cite are only examples: I haven't seen The Return of the King, so I think no evil of this movie!

    ---

    Update

    a. Thanks
    to eroticroger (IMDb forum), stole-this-acct and vomolbo (comments here: see below) for their criticisms and suggestions.

    b. Problems and (sometimes) solutions

    1.

    Two problems: novelty; and most popular films stuck in the middle of the rankings, because they attract more negative votes than less famous ones.

    Solution: 1. Let users rank albums in the usual manner. 2. Then, deduce preferences of every user. 3. Then, compute the global ranking, considering the preferences of all the users.

    2.

    Problem: required computing power.

    Solution: ELO system. I.e. every time a "match" between movies is decided by a user, the winner movie gains some points, and the loser loses the same amount of points; the number of points is proportional to the difference between the real result and the expected result.

    Consequent problem: you can't change the result of a "match", i.e. you can't change your mind about which of two movies you prefer. I.e. use of ELO is not consistent with the solution pointed out above, in point 1.

    3.

    Problem: abuse.

    Solution: give more weight to the users that in the past showed an opinion similar with the majority. You can think at this as something similar to the HITS algorithm: users are hubs, movies are authorities. You can use an ELO system here too.

    Consequent problem: "advantage to the votes that were cast first and initially established the rankings" (eroticroger).

    c. Similar ideas: MoviePig, in some way.
    Thursday, November 8th, 2007
    8:11 am
    In Defense of Troll

    A troll is just someone who doesn't think like you. OK, nobody thinks exactly like you, but the troll differs from you in a radical way: he doesn't agree with some of your basic principles, e.g. your ideas of (n)etiquette, dialectic, decorum. But remember: symmetrically, you don't agree with him, so you are a troll for him. Who said that your principles are better than other people's ones?

    But the real points in defense of the troll are:

    1. He is alone, you are many. If the trolls were the majority in a discussion, they wouldn't be trolls. The troll is always a minority, and still he keeps defending his position: he is a political hero.

    2. He lives for the discussion. Common wisdom says "don't feed the troll": because, without discussion, he's nothing. He is open to the world.

    ---

    Bottom lines: I'm sorry for the bad usage of the English language. If you know about anyone that already had this idea, please tell me.
    Monday, July 9th, 2007
    8:18 am
    Alternative title for this blog
    Sun Tzu boulevard.
    8:15 am
    How was the casting for Harry Potter done
    Tuesday, March 6th, 2007
    4:35 pm
    100 years, 100 celebrities, 2 super-celebrities.

    I've checked, for each of the 100 most important persons of the 20th century (as determined by Time), which of the other 99 is more relevent to him/her. I used the simple algorithm I've described in my previous post (and a Ruby program kindly written by Giuseppe "Oblomov" Bilotta, thanks!). The results are 1MB of data (e-mail me and I'll send it to you). But maybe an image worth 1000 numbers:

    While some links are trivial, others are weird: maybe the latter are dued to bugs in Google page count, or, more probably, in my algorithm. But the result is: 20th century was ruled by Winston Churchill and Marilyn Monroe.

    Or, from another point of view, 20th century was the century of WWII and of the Society of the Spectacle.




    Update: you have better images if you consider the proximity of two persons, using this simple formula:
    pagesXY^2 / (pagesX * pagesY)
    The results are similar to Cilibrasi and Vitanyi's Normalized Google Distance (and to the other similar formulas), but the formula is simpler. The formula is also similar to "mutual information", but it gives different results.

    See the image, showing the 100 couples with the higher proximity. This image was automatically created (as the first one) by GoVisual Diagram Editor, "hierarchical" layout (even if in this case there is not hierarchy, because every couple of nodes is linked two-ways):

    There are two great groups: above, spectacle; below, "serious stuff" (politics and culture). In the spectacle section, left is for cinema, right for music: note that Frank Sinatra is in the middle! In the "serious" section, you have all the intellectuals together. The politics is divided in two parts: the left part hosts persons of the first half of the century; right part is for the second half. But you have also other taxonomies, because you can see: national leaders (Reagan, Tatcher, Gorbachev, Mandela); civil rights movement (Mandela, M.L. King, Jackie Robinson); Christians (M. Teresa, John Paul II, B. Graham); activist women (M. Teresa, Keller, E. Roosevelt). Guevara is near the civil rights movement, but he's also linked to Dylan linked and to two pop icons (Monroe and Lee).




    Update 2: another way to visualize the data in GoVisual Diagram Editor is the circular layout. I used it to visualize the top 100 links among Wikipedia core articles:

    Let's read it counterclockwise: life - human - education - society - health - governement (law)- science (medicine) - history - community - information (number) - media - computer - personal - business - technology - energy - engineering (architecture) - communication - internet - time - day - life.

    There are two other little groups: one is "continents", because there are Europe, Africa, Asia, Australia; Europe is related to technology in the bigger group. The other group is "natural sciences": mathematics, physics, chemistry, biology, nuclear; mathematics and physics are related to engineering.

    Saturday, February 3rd, 2007
    4:48 pm
    Google knows it. All.

    The idea

    Formula to compute the relevance of a topic (expressed as a query for a search engine) in a context (it too expressed as a query):

    ((pct/pc)^2)/(pt/pi)
    pct = (number of) pages with the two queries together
    pc = pages with the context-query
    pt = pages with the topic-query
    pi = pages indexed by the search engine

    Example
    context = "Sergey Brin"
    topic = "Google"
    pct = 1,020,000
    pc = 1,090,000
    pt = 877,000,000
    pi = 25,700,000,000
    relevance of Google for Sergey Brin = ((pct/pc)^2)/(pt/pi) = 25
    relevance of Sergey Brin for Google = 0,03 (i.e. speaking of Brin, you should cite Google, but the contrary is not true)

    What you do is to muliply the percentage of the topic in the context for the fraction of this percentage divided by the percentage of the topic in the entire web. This way, you don't give an advantage...

    ...neither to those topics that you find in many web pages (because they have a low fraction context/web)

    Example
    context = "Sergey Brin"
    topic = "the"
    relevance = 5

    ...nor to those topics that are found only in context, but in few pages (because they have a low percentage in context)

    Example
    context = "Sergey Brin"
    topic = "studied computer science and mathematics before co-founding Google"
    relevance = 1

    The best topics are those that you find often in context, and not often out of context.

    consider the logarithm. topics with 4 or more are practically always associated with that context; topics with 2 or more should be cited in an encyclopedic voice about that context.

    to write a web page is to take part in the definition of all the terms used. beyond wikipedia. it's the real lower-case semantic web.

    it's about the explicit data. but how much implicit is the common knowledge?

    problem: synonims? but with this formula you could define synonims two words which, tanken as context, produce similar ranking. quite the same for translations.

    Applications

    Networks

    this formula, used by users or by a bot, could produce an encyclopedic network, a synaptopedia: every name, taken as context, is linked with an arrow to the name that is more relevant in that context. only names, i.e. expressions put in quotation marks. which structure will be produced?

    users-driven synaptopedia: every user propose, for a givenname, another name. the user is given the points of that association. the rank of an user is given by the points obtained divided by the number of associations proposed. the specialists are ranked beter, because with this formula you obtain more points for a less-cited context. to avoid spam: associations below x points are deleted; eventually, you can add a name only if it's the best topic for a given context.

    note for every users-driven application: you can contribute, but it's better than Wikipedia, because your contribution cannot be deleted by anyone, and it's ranked for its objective value.

    bot-driven synaptopedia: a bot takes names from a list and computes the association among them. problem: scalability.

    note for every bot-driven application: you could decide in every moment how much data you want, and how much these data should be linked. these could be two scrollbars.

    network of topics in a given context: it's a way to classify the topics. thus, the different meanings of a name are divided. this method could be used to specify a web search.

    Others

    definitions (users-driven): for a given context, the users try to find the topic with the higher relevance. not only names (see networks), but every expression valid as a query in a search engine.

    wikipedia (users): an objective method to decide if a topic should be included in a given voice (context).

    search engine (bot): what happens if you feed the search engine not only with the context-query, but also with the best topic-query? could this bring us to an automatic encyclopedia (see Ionut Alex. Chitu's idea)?

    extraction of relevant terms from a text (bot). no more need for tags.

    A new idea?

    far better than Googleshare.

    it's more similar to the method by Rudi Cilibrasi and Paul Vitanyi, but their formula is better.

    the idea to multiply the score for the fraction score/possibilities is taken from Tom Wesson - The Science of Soccer.

    if you know about anyone that already had this idea or something similar/better, please tell me.

    Bottom lines

    I'm sorry for the bad usage of the English language.

    I'm conscious that, assuming that this idea could be appetible (and this is a strong assumption...), this way to present it makes it totally ugly.

    This post is in progress. But you are always invited to comment it and ask me (above all regarding what is not clear).

    Wednesday, June 14th, 2006
    2:10 am
    Why the Star Wars Prequel trilogy changed the sense of the Original trilogy

    Before viewing the Prequel trilogy: Luke turns Darth Vader back to the Light Side of the Force.

  • In fact, Darth Vader kills the Emperor to save his son.

    After viewing the Prequel trilogy: Luke doesn't turn Darth Vader back to the Light Side.

  • In fact, Darth Vader kills the Emperor because trying to kill his master is a typical Sith apprentice's behavior. [not correct: see comments. The new argument is:] In fact, killing someone to save one of your relatives (your wife, or your son) is a typical sign of you belonging to the Dark Side, not a sign of you turning back to the Light Side.

    Before: Luke becomes a Jedi.

    After: Luke never becomes a Jedi.

  • Darth Vader is destined to bring balance to the Force; but, Darth Vader and the Emperor killing each other, there are no more Sith; so there are no more Jedi. In fact, Obi-Wan (killed by Darth Vader) and Yoda (natural death) are dead.
  • Furthermore, Yoda tells Luke that he needs only to kill Darth Vader to become a Jedi; Luke doesn't kill Darth Vader; so he never becomes a Jedi. [Not true: see comments.]
  • Furthermore, Jedi are taken into the Jedi fold during infancy/childhood and trained for a long time; but Luke is taken into the Jedi fold as a young adult, and his training is too short; so he is not a Jedi.

    Corollary

    Before: At the end, the Light Side wins.

    After: At the end, neither the Light Side nor the Dark one wins, but either loses, so there is a permanent balance in the Force.


    Corollary

    Before: Star Wars is the story of Luke, a guy who makes the Light Side win.

    After: Star Wars is the story of Darth Vader, a guy who brings balance to the Force.




    Open question: Who did teach the "immortality trick" to Darth Vader? His spirit is present to Luke like those of Yoda and Obi-Wan, who were taught by Qui-Gon. [As Dug pointed out, he could have found the way by himself.]

    [Strong use of Wikipedia articles]

    Thanks to Dug (or Duggy) for the criticisms.

    Standard bottom line: Probably someone thought these same things. Please let me know who.

    A bottom line that should not be standard: Excuse me for my bad English, in the post and in the comments.

  • Sunday, May 7th, 2006
    1:29 pm
    The Apocalypses are near
    Ok. Are you for science or for psychedelic, is your faith Mayan or Christian? It doesn't matter. For you, the End will arrive in the same date.

    The Technological Singularity will happen in 2012.

    But we can be more precise. The Mayan calendar will end in the same day of the Timewave Zero predicted by the psychedelic expert Terence McKenna (who, for his theory, uses I Ching and Alfred North Whitehead): December the 21st, 2012.

    On the other side, the Antichrist will be born June the 6th of this year, 2006 (6.6.6). (1) But when will he manifest himself? Maybe, at the age of 6 years and 6 month. I.e. in the December of 2012...



    (1) Why of 2006, and not of 1006, or 2106? Because Benedict XVI is the penultimate pope of the history.
    Tuesday, December 6th, 2005
    1:53 am
    The Great Soccer World Cup Conspiracy
    Germany or Brazil will win the next World Cup. I don't say this because Germany is the host, or because Brazil is the winner of the last Cup and the most talented team. One of these teams will win because there's a trend. Which trend? You can see it by yourself:


    19301934-38195019541958-621966197019741978198219861990199419982002
    Italy
    x






    x





    South Americax
    x





    x

    x




    Germany


    x



    x



    x



    Brazil



    x

    x





    x

    x
    Europe




    x







    x


    (South America = Uruguay, then Argentina; Europe = England, France)
    1:51 am
    Perché l'Italia non vincerà il Mondiale prima del 2014
    Il prossimo Mondiale sarà vinto da Germania o Brasile. Non lo dico perché la Germania è la squadra ospite, o perché il Brasile è la squadra campione in carica e con più talenti. Vincerà una di queste due squadre perché c'è un trend. Quale trend? Lo potete vedere da voi:


    19301934-38195019541958-621966197019741978198219861990199419982002
    Italia
    x






    x





    America del Sud
    x
    x





    x

    x




    Germania


    x



    x



    x



    Brasile



    x

    x





    x

    x
    Europa




    x







    x


    (America del Sud = prima Uruguay, poi Argentina; Europa = Inghilterra, Francia)
About LiveJournal.com