If you’re not familiar with tag clouds, I should refer you to my friend steve. The steve project is all about collecting tags on works of art, from a variety of partner institutions. One common way of visualizing the data collected from a tagging experience is to produce a tag cloud.
These are the steve tagger’s top 100 most contributed tags. The red ones happen to be a few of those that I’ve entered. You might notice that some of them look a little funny, like lightgray. This is because we use a normalization process to equate words that are essentially the same. For example, consider these tags on The Boat Builders by Winslow Homer:
A term like “New England” may have been entered with proper capitalization, or with all lowercase or uppercase characters. To treat these entries as the same basic concept, we normalize them to “newengland”. Similarly, some folks might type a term like “seashore” with a space, and other folks without a space. We recognize that sometimes these minor differences make a significant difference in the meaning of a term, so we keep all of the original tags as they were entered. This normalization handles most cases properly (further research into the occurrence rate of special cases continues in the Text Tags and Trust project), and it allows us to display objects that have been tagged with all variants of a term when it is clicked in the tag cloud.
After handling simpler transformations like whitespace removal and lowercasing, the next step we’re interested in taking is lemmatization. You can see a few examples where this can be done in the tag cloud above… the term “rocks” can be lemmatized to “rock”, and “children” to “child”. By adding another normalization routine which takes this extra step using a lemmatizer from the Natural Language Toolkit (which makes use of the WordNet database’s built-in morphy function), we can generate the following tag cloud for this painting:
This may just be my perspective, but I find that the lemmatized version really gives a better sense of representation, uncluttered by redundancy. We’re still doing some fine tuning, and looking into how to handle terms with multiple words. Our partners at the University of Maryland are doing some research to figure out how we might define heuristics to handle multiple word terms (e.g. a rule that says we should lemmatize the second word of an adjective-noun pair), based on the collection of tags that we currently have.
As we study the results, we will definitely be considering scenarios where this sort of normalization fails to recognize nuance and gives misleading results, and how to handle this both in our research and in user interfaces. It’s a tricky problem that is sure to lead to many interesting questions and findings about folksonomic linguistics to discuss in the future.
Filed under: Technology