zhouzhuang (zhouzhuang) wrote,
zhouzhuang
zhouzhuang

Google knows it. All.

The idea

Formula to compute the relevance of a topic (expressed as a query for a search engine) in a context (it too expressed as a query):

((pct/pc)^2)/(pt/pi)
pct = (number of) pages with the two queries together
pc = pages with the context-query
pt = pages with the topic-query
pi = pages indexed by the search engine

Example
context = "Sergey Brin"
topic = "Google"
pct = 1,020,000
pc = 1,090,000
pt = 877,000,000
pi = 25,700,000,000
relevance of Google for Sergey Brin = ((pct/pc)^2)/(pt/pi) = 25
relevance of Sergey Brin for Google = 0,03 (i.e. speaking of Brin, you should cite Google, but the contrary is not true)

What you do is to muliply the percentage of the topic in the context for the fraction of this percentage divided by the percentage of the topic in the entire web. This way, you don't give an advantage...

...neither to those topics that you find in many web pages (because they have a low fraction context/web)

Example
context = "Sergey Brin"
topic = "the"
relevance = 5

...nor to those topics that are found only in context, but in few pages (because they have a low percentage in context)

Example
context = "Sergey Brin"
topic = "studied computer science and mathematics before co-founding Google"
relevance = 1

The best topics are those that you find often in context, and not often out of context.

consider the logarithm. topics with 4 or more are practically always associated with that context; topics with 2 or more should be cited in an encyclopedic voice about that context.

to write a web page is to take part in the definition of all the terms used. beyond wikipedia. it's the real lower-case semantic web.

it's about the explicit data. but how much implicit is the common knowledge?

problem: synonims? but with this formula you could define synonims two words which, tanken as context, produce similar ranking. quite the same for translations.

Applications

Networks

this formula, used by users or by a bot, could produce an encyclopedic network, a synaptopedia: every name, taken as context, is linked with an arrow to the name that is more relevant in that context. only names, i.e. expressions put in quotation marks. which structure will be produced?

users-driven synaptopedia: every user propose, for a givenname, another name. the user is given the points of that association. the rank of an user is given by the points obtained divided by the number of associations proposed. the specialists are ranked beter, because with this formula you obtain more points for a less-cited context. to avoid spam: associations below x points are deleted; eventually, you can add a name only if it's the best topic for a given context.

note for every users-driven application: you can contribute, but it's better than Wikipedia, because your contribution cannot be deleted by anyone, and it's ranked for its objective value.

bot-driven synaptopedia: a bot takes names from a list and computes the association among them. problem: scalability.

note for every bot-driven application: you could decide in every moment how much data you want, and how much these data should be linked. these could be two scrollbars.

network of topics in a given context: it's a way to classify the topics. thus, the different meanings of a name are divided. this method could be used to specify a web search.

Others

definitions (users-driven): for a given context, the users try to find the topic with the higher relevance. not only names (see networks), but every expression valid as a query in a search engine.

wikipedia (users): an objective method to decide if a topic should be included in a given voice (context).

search engine (bot): what happens if you feed the search engine not only with the context-query, but also with the best topic-query? could this bring us to an automatic encyclopedia (see Ionut Alex. Chitu's idea)?

extraction of relevant terms from a text (bot). no more need for tags.

A new idea?

far better than Googleshare.

it's more similar to the method by Rudi Cilibrasi and Paul Vitanyi, but their formula is better.

the idea to multiply the score for the fraction score/possibilities is taken from Tom Wesson - The Science of Soccer.

if you know about anyone that already had this idea or something similar/better, please tell me.

Bottom lines

I'm sorry for the bad usage of the English language.

I'm conscious that, assuming that this idea could be appetible (and this is a strong assumption...), this way to present it makes it totally ugly.

This post is in progress. But you are always invited to comment it and ask me (above all regarding what is not clear).

Subscribe
  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 10 comments