Pythonism

code and the oracular

Google Sets, keywords & taxonomies

leave a comment »

I am slightly fascinated by Google Sets because of its connectedness to ontologies and knowledge extraction / datamining etc. I came up with a script that can be used to auto query Sets to build families of keywords that all share a very approximate subject domain in common. I used this for some personal research on a way to use page meta data keywords to assign pages to a given subject domain. For each subject “Physics…Chemistry…Biology…” etc I had the subject name as dictionary key and corresponding dictionary value as a list of keywords that indicated a likelihood of meta keywords containing that word belonging to text about that subject.

In the small print google says no use for commercial purposes, but they lock you out if you automate more than a handful of queries anyway, so I don’t think I was being that evil.

I wondered if the meta keywords could provide a sufficiently accurate way to classify documents so I fed my system some dummies and it came right about 80% of the time. In my code I started with a seed list of 10 or so keywords, which are then ‘grown’ in sets to populate the list with more similar domain words. I believe the algorithm behind sets has been used in some ontology work elsewhere including NELL and Google Squared.

I visualised my document collection as a bag of rice grains scattered onto the floor. My subject categories were like playing cards thrown onto the grains. In a taxonomic system you want maximum coverage by the cards and minimum overlap in the terminal categories of the bottom level. There are two kinds of overlap, true overlap where parts of each card are not covered, and enclosure where a bigger card completely covers smaller ones. For a taxonomy that covers every instance you want the terminals to have no overlap. Parent and ancestor categories can cover by inclusion but overlap among the terminals causes ambiguity which is bad for the system. If a user is looking for “biophysics” papers and some of those have been placed, unbeknownst to her, under “physical biology” then the user has been failed by the system.

It’s a nice creative task to develop a taxonomy which has complete coverage and also a category size that stays roughly constant. This is a large feature of human knowledge, our systems of description autogenerate in the brain and we use them all the time. Think of early anthropologists writing down south sea islanders taxonomic knowledge of plants and animals…

I believe though that no single taxonomy can cover everything. We need multiple ones in parallel to have truly dynamic knowledge. So to classify a large collection of documents by the conventional academic subject tree alone is less powerful than using another alternative taxonomy such as classifying documents by the geographic location of their authors, and by the number of citations too. This enrichment of the ways people classify will allow more subtle patterns to be seen, and we see that in many of the pretty mash-ups and visualisations that have been done recently. If your data is to be open source then early design decisions favouring reusability are a very sound thing. If you love data and its representation it is fun anyway.

It’s part of good data driven sites that they have multiple taxonomies. Think of a wordpress site where you can browse posts by category or by a tag cloud. If browsing is all it is about then you shouldn’t get hung up on making your tree perfectly symmetrical or consistent. If you put biophysics under both physics and biology in a browsable tree then that is not ambiguous, rather it is speeding up the process of users getting to what they need. Again it is the terminal category structure that should not be ambiguous: think of an organism that is a living thing without being either eukaryotic or prokaryotic. There isn’t one, so the tree has coverage. You can’t have an instance in a non-terminal category in that kind of tree. The tree of subjects in yahoo’s web directory does allow that feature though. Think what kind of application you are looking for. At first I made this mistake and got hung up on a mathematically beautiful idea of categories instead of a usable one.

We’ve come a long way since the Dewey Decimal System, maybe one day you’ll be able to tell amazon to look for garden furniture that is green, baroque in style, not polypropolene and weighs under 500 grams for each piece. A smart system built by someone who has walked the ontology/taxonomy walk should be able to do this by hashing some separate trees together with a database search. It’s a big component of human knowledge that we make inferences like this, and machines are doing it too in smarter and smarter ways these days.

Here’s the sets query function, just add some code to tidy the whitespace out of the markup.


def sety(wordtuple):
    """performs a google sets search
    takes a 5-tuple, returns list"""
    url='http://labs.google.com/sets?hl=en'
    starting='&q1=%s&q2=%s&q3=%s&q4=%s&q5=%s'
    query=url + starting % wordtuple
    res=urllib.urlretrieve(query)
    file=open(res[0])
    back=file.readlines()
    file.close()
    return back

Advertisements

Written by Luke Dunn

November 24, 2010 at 6:00 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: