Monday, December 9, 2013

Taxonomy Governance

Recently I was asked to speak on a panel on taxonomy governance, so this gave me an opportunity to reflect more on the subject. "Metadata Enhancement for Improved Content Management - Taxonomies and Governance" was the title of a panel I spoke on at the Gilbane Conference 2013: Content and the Digital Experience in Boston on December 3.

When I had first heard of "governance" with respect to knowledge management and taxonomies, in 2005, it did not sound like a subject of interest to me. Perhaps I was thinking of it in terms business process management in general, which is not my field. Over the years I have come to realize that governance is a very important part of any taxonomy, and while governance can be limited to the governing the taxonomy itself it can extend to other areas that are related to the taxonomy, such as indexing and content management. Most significantly, though, there is a synergy or dualism of taxonomies and governance: to be effective taxonomies must be governed, yet the existence of a taxonomy itself is a form of governance.  A taxonomy, after all, is a kind of controlled vocabulary, and “controlled” means governed. It's better to describe what taxonomy governance entails than to try to define it. Taxonomy governance comprises the policies, procedures, and documentation for the ongoing management and use of taxonomy. 

My main points in my brief presentation were:
  • Governance process begins when taxonomy development begins.
  • Each taxonomy is unique and has its own governance policy.
  • Governance includes both:
    • Documented editorial policies
    • Taxonomy management procedures and responsibilities
  • There are minimal guidelines to a taxonomy when it is started.
  • Decisions reached to questions as they come up in the process are documented and eventually become policy.
  • Taxonomy policy/guidelines includes both:
    • Taxonomy specifications, style and maintenance
    • Taxonomy usage and indexing/tagging/categorization policy (manual or automated)
Reflecting on the different taxonomy jobs I have had and projects I have worked on, taxonomy governance has taken many forms beyond the obvious of documenting the taxonomy editorial policies. Even though I did not hear of taxonomy governance until I had been working for years with taxonomies, I actually had been involved with governance for many years prior, just not by that name. My first job working with taxonomies (called then controlled vocabularies) was with the title of Vocabulary and Quality Management Specialist. In addition to maintaining the controlled vocabularies according to prescribed procedures, my duties included writing guidelines for the indexers using the vocabularies, especially for new topics and current events, and checking the published content for possible vocabulary-related quality issues. At my next employer, a developer of search software with built-in taxonomies, documenting how to create the taxonomies in a consistent style was simply a part of the documenting how to use the software. Later, on an assignment with a consulting firm, on ongoing contract involved making regular updates to ecommerce client's product taxonomy, following a certain procedure and workflow that was tracked in SharePoint. Finally, in more recent years as an independent taxonomy consultant, I have made sure that taxonomy editorial policies and maintenance guidelines are always a part of my project plans.

When a taxonomy project is short on time or budget, there may be a temptation to skip the governance documentation and planning. But in the long term, that will cost more. Time will be wasted by the taxonomy editors going back through old emails to try to find out what was decided when individual questions came up. Taxonomy editors will also waste time having to redo some of their work, after realizing that they were not following a consistent style or policy. Finally, and most crucially, lack of governance will likely result in an inconsistently developed taxonomy, which in turn leads to inconsistent indexing/tagging, no matter the method used. Then the main purpose of the taxonomy is defeated.

Taxonomy governance might not be as hot a topic as it was a few years ago, but that's only because it has become standard, accepted practice. Yet there is still a lot that an organization owning a taxonomy can learn about governance in the form of best practices and case studies. While organizations may not want to share their taxonomies, as intellectual property, hopefully they will share their experiences and tips on taxonomy governance.

Saturday, November 9, 2013

Information Architecture and Taxonomies

While interest in “information architecture” by that name has declined in the past decade, interest in what information architecture involves continues to be strong, and perhaps there is some merging of the fields of taxonomy and information architecture.

At one point in my career I wanted to be an information architect, to organize the pages and menus of websites and intranets.  The discipline’s leading professional association, the IA Institute additionally describes the field as “The structural design of shared information environments.” But within a couple of years, I found that interest in my information architecture skills, at least for small websites (“little IA”) was getting squeezed out for skills in either graphic design or technical web development. Over time it also seemed as if information architecture was being replaced by the growing field of user experience design (UXD). Indeed Google search trends show a definite decline in interest in the phrase “information architecture” during the same period of a steady growth in interest in “user experience.”

I was therefore pleasantly surprised to find that information architecture was one of the themes at this year’s Taxonomy Boot Camp (Washington, DC, November 5-6, 2013), the leading conference dedicated to taxonomies.

Information architecture was a central part of the keynote “Taxonomy Is Power: Bringing It All Together,” presented by Bob Boiko. He started off explaining that information systems are a triad of people, information, and technology. But he, too, had observed that information architecture (IA) has often been “captured” by user experience (UX), moving away from technology toward the user, but the “information” piece of the triad sometimes gets lost along the way and needs more attention. Bob defined information architecture as “the art and science of designing information structures” and that information architects live in the space between art (design) and science (technology). Information architecture is also about naming things, and taxonomies can help engineers and designers name things for both the front end and back end of an information system. Bob said that taxonomists should look at and “own” the concept of information architecture.

The conference also featured a session of three presentations under the heading “User Experience (UX) in Taxonomy Design.” Michael Rudy, of  the consultancy Factor, spoke on the benefits of integrating user experience  with information management, and Bram Wessel, also of Factor, presented on how different methods of user research, common in user experience design, such as card sorting, tree testing, personas, and prototyping, are also applicable to taxonomies. Taking a different angle to the issue, Ben Licciardi of PPC presented methods of designing the manual indexing/tagging interface for taxonomy use.

There are various perspectives and approaches to this field, whether stressing structure as in “architecture,” naming, as in “taxonomy,” or meaning, as in “semantics.” Different labels may resonate better with different audiences. The week of the conference I was also indexing a book on user experience design (a small project to do on the plane and to broaden my knowledge of the subject). While “taxonomy” was not mentioned in this light book, “semantic design” was the name of a section which mentioned information architecture, organizing information, and metadata.

Several years ago, perhaps 2007, when I introduced myself as a taxonomist to someone at a professional conference, I was asked what the difference was between taxonomists and information architects. My answer then is the same as it is now: there is definitely a significant area of overlap between the skills, tasks, and responsibilities in both professions, although there are some areas that concern information architects and not most taxonomists, and there are areas that concern taxonomists and not most information architects. So, it may only depend on what kind of information architect or kind of taxonomist you are. I hope one day to also attend the main information architecture conference, the IA Summit and continue this discussion, as interest in taxonomies is remaining strong.

Sunday, October 6, 2013

Taxonomies and Text Analytics Compared

Last week (September 30 – October 1) I attended the Text Analytics World conference in Boston as an invited speaker.  This is the second year was fortunate to present at and attend this conference, which also meets in San Francisco in the spring. I posted a blog about the conference last fall, “Text Analytics and Taxonomies,” discussing the strong connections between taxonomies and text analytics in serving similar data/information retrieval goals. That connection between the two was again apparent at this year’s conference, with many speakers mentioning taxonomies, and I came away with additional analogies, beyond their shared purpose.

Problematic definition

Both taxonomies and text analytics are not well defined, and can have both a narrow definition and a broad definition. For taxonomies, the narrower meaning is a hierarchical tree of concepts arranged with broader and narrower relationships. The broad meaning of taxonomy is any controlled vocabulary, whether hierarchies, facets, thesauri, authority files, or simple terms lists to fill metadata fields. For text analytics, the narrower meaning is “text mining”, the process of deriving high-quality information contained in natural language text. But the conference chair, Tom Reamy of the KAPS Group, explained that the conference takes a broader definition of text analytics to include not only text mining but also, auto-categorization, sentiment analysis, predictive analytics, entity extraction, and machine learning.

There is also the issue of whether the name is appropriate. Some people don’t like the name taxonomies, and try to avoid it. Similarly, there are issues with the designation of “text analytics.” Discussion in the conference’s expert sessions and closing session, brought up the issue that perhaps a better name is needed for the field. Both “text” and “analytics” have issues, as they both have assumed narrower meanings. It comes out of the field of knowledge management, but that field is too broad. A more accurate label that Tom Reamy suggested was “unified data insights,” but it will stay text analytics for now.

Technology and human effort

Both taxonomies and text analytics rely on technology/software, but neither is a 100% automated solution, nor can the software products be used an out-of-the-box solutions without significant trained and skilled usage. If we consider the software as “tools” rather than “solutions,” we have a more realistic understanding of what the software can do. The process of building a taxonomy is aided by taxonomy or thesaurus management software, which is kind of a tool that an experienced taxonomist uses to manage the terms, relationships, synonyms, notes/definitions, and other term attributes. Similarly text analytics software, and auto-classification software in particular, requires expertise to leverage the tool for desired results. This was the theme of a presentation on selecting text analytics tools by Janine Johnson of Versik Analytics (who also used “tool” in her presentation title).

As I explained in my presentation, “Taxonomies for Auto-Tagging Unstructured Content,” both of the leading methods of auto-categorization, rules-based machine learning statistical methods, require considerable human input. In rules-based auto-categorization, experts need to write or edit rules for each taxonomy concept that leverage combinations of synonyms and proximity or other Boolean operators; and in machine-learning auto-categorization, experts need to identify and essentially pre-index a large set of sample documents for each taxonomy term, for the system to learn from the human indexed example.

Multidisciplinary background

Both taxonomies and text analytics are seen as a fields of expertise, methods of knowledge management, and at least parts of a solution to an organization’s information management problem. However they are not academic disciplines or majors. Rather, the educational background and skills of people who work in the fields of both taxonomies and text analytics is somewhat varied and multidisciplinary.

In taxonomies, library/information science is the most dominant background, but probably does not account for any more than half of practicing taxonomies. Information architecture/user experience design, database design, knowledge management, editorial, and subject matter (health, law, science, business, etc.) expertise are also common backgrounds.

In text analytics, computer science is the most common background. A show of hands of the conference participants indicated that the majority had computer science or engineering backgrounds. But linguistics is also important (although the small minority at this conference were more hesitant to reveal themselves). The keynote speaker, Dr. James Pennebaker, was a psychologist and explained why psychology is also important to text analytics. Participants in the closing expert panel answered my question on educational background with a similar answer of a combination of computer science/programming, linguistics, and cognitive sciences.

In addition to the interdisciplinary background of taxonomists and text analytics professionals, the applications of taxonomies and text analytics also span all disciplines and industries. Conference case studies included applications of text analytics in education, pharmaceuticals, healthcare, publishing, telecommunications, and federal agencies.

Tuesday, September 17, 2013

Taxonomy Terms with “And”

In considering best practices for developing taxonomy term labels or names, there is the question about the use of the word “and” within taxonomy terms. My previous two blog posts were called “Tags and Categories” and “Card Sorting and Taxonomies,” which demonstrate how common it is to have the word “and” in titles, headings, or other labels. By extension, does it work in taxonomy terms?

The standards for taxonomies, ANSI/NSIO Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies and ISO 25964-1 Thesauri and Interoperability with Other Vocabularies make no mention of terms with the word “and.” While it is not explicitly prohibited, it is neither mentioned as an acceptable form among the rather exhaustive list of term format types. Even the section on compound terms makes no mention of terms with the word “and.” So, one might conclude that terms should not have the word “and” within them. Yet it is not uncommon, especially in larger, more specialized taxonomies and thesauri.

The simple little word “and” can actually have two different meanings:
1)      the intersection of two concepts, to include only that which belongs to both, which is the Boolean operator AND
2)      the combination or union of two concepts, to include any of either, which is actually the Boolean operator OR.
When it comes to taxonomy terms, the word “and” could have either of the above two usages, and it’s very important to know which it is in which case.

“And” meaning AND

My blog post title “Card Sorting and Taxonomies” involves the first meaning, the intersection of both concepts, which in this case is the use and suitability of card sorting specifically for taxonomies. “Card Sorting and Taxonomies” is more concise than saying “the suitability of card sorting for taxonomies,” and taxonomy terms need to be concise. Examples of the use of “and” in this (Boolean AND) meaning in taxonomy terms that I have run across include:
    Children and Television
    Gender and Poverty

The choice of using “and” is significant. It means any intersection/relation of these two concepts. “Children and Television” comprises all of the following: children’s television shows, the impact of television (not just children’s programming) on children, the depiction of children in television, etc. Similarly “Gender and Poverty” covers various issues, such as data on poverty rates by gender, how poverty effects the genders differently, and reasons why more women are poor in developing countries.
It is easy to identify this meaning of the word “and” when the two concepts linked by the conjunction are quite distinct. In many taxonomies, the preferred policy is to avoid creating such terms, lest the taxonomy become too large and complex.

“And” meaning OR

My blog post title “Tags and Categories” involves the second meaning, the combination of both concepts. I described what tags were and what categories were and compared them. Examples of the use of “and” in this (Boolean OR) meaning in taxonomy terms that I have run across include:
   Measurement and Analysis
   Laws and Regulations
   Roads and Highways
   Maintenance and Repair
An additional example is the title of the online course I teach:Taxonomies and Controlled Vocabularies.”

The main reason to create such terms is that, while some content deals with one or the other of the two linked words, a significant amount of content really has to do with both, and users probably don’t care to make the distinction either, so it’s better to have just a single concept in the taxonomy. But one word is not equivalent to the other, so a taxonomy term cannot be created from just one word and the other designated as its nonpreferred term/synonym. Another situation for these types of taxonomy terms is a small browsable taxonomy that does not utilize/support synonyms. An additional reason to create them is that they can boost SEO (search engine optimization) in website labels by giving more words prominence. Finally, the combined terms can also appease competing stakeholders who both want their preferred label as part of the term name.

The difference in a taxonomy

If you have taxonomy terms with the word “and” in them, it needs to be clear which of these two Boolean meanings it is, not only to ensure accurate content tagging, but also to ensure the proper relationship of the term to other terms in the taxonomy. Recently I was reviewing a taxonomy with the term “Investment and Trade” and by itself, I could not determine whether it meant the intersection of combination of these two words, so I didn’t not know how it should be related to terms of “Investment” and “Trade.”

A term with the Boolean AND is a narrower term to terms of both its component parts, what is known as polyhierarchy. “Children and Television” is narrower to both “Children” and to “Television.” When there occurs a term with Boolean OR, such as “Measurement and Analysis,” it is expected that the component words to not exist as preferred terms in the taxonomy. Rather, each word “Measurement” and “Analysis” could be nonpreferred terms/synonyms for “Measurement and Analysis.

Friday, August 30, 2013

Card Sorting and Taxonomies

Card sorting is a common technique in information architecture for developing the organization of menu labels or categories on websites. It would thus seem to be a very suited methodology for developing all kinds of taxonomies, but in actual practice card sorting is not utilized for most taxonomy projects, at least not in my experience.

Card sorting gets its name from the paper-based approach of having numerous category or concept names written down each on a small index card, and then the cards can be sorted on a table into logical categories. Multiple stakeholders and/or test users are given the opportunity in turn to organize the cards as they deem appropriate, and the person administering the card sort, takes note of the choices and considers them for the actual organization structure. Today, card-sorting software, especially that which is web-based to allow remote access, has largely replaced the physical cards.

There are two variants to card-sorting exercises, the open card sort and the closed card sort. In an open card sort, participants sort the labeled cards in any groupings they see fit and then they assign their category groups with any group name they want. In a closed card sort, the participants are already presented with a set of named top category groups that they cannot change, and are asked to sort the labeled cards into the pre-assigned categories. Each type of card sort has distinct objectives and is suited for different stages of the project.

Open card sorting is a good way to get a new taxonomy from scratch off the ground when you have some concepts (extracted from the content) and don’t know how to organize them. However, this is increasingly no longer the scenario. It’s rare to start creating a taxonomy from scratch with no other reference for top categories. There are so many taxonomies in existence now for all subjects, that it’s easy to find a starting point as a model. Furthermore, the owner of a taxonomy may have already designated the top categories for business reasons.

The aim of closed card sorting is to determine in what broader category narrower categories belong, especially if there is uncertainty. But if a narrower category could rightfully belong under more than one category, rather than force a choice between one or the other based on a card sort, the subcategory could belong under both. This is what taxonomists call “polyhierarchy,” and it acceptable as long as the hierarchy is sound and valid in both locations. Thus, closed card sorting is only needed when you have decided you do not want polyhierarchy.  Polyhierarchy is generally a good thing, because it provides more than one navigation path to the same results, and different people choose different paths. Sometimes, however, polyhierarchy is avoided near the top levels of a taxonomy in order to maintain a sense of tree structure.

Card sorting is most practical for just two levels of hierarchy: concepts and their immediate parent categories. It’s possible but unwieldy to suggest to users that they may create three levels, and some card sorting software does not even allow it. Often it is more reliable to just run a second series of card sort testing for another hierarchical level in the taxonomy. However, running multiple card sort exercises for different hierarchical branches of a taxonomy can be quite impractical, if not also costly and time-consuming.

Finally, card sorting works only for traditionally hierarchical taxonomies. It does not work for faceted taxonomies, where terms from different facets/attributes are selected in combination to limit or filter search results. Faceted taxonomies are becoming increasingly common.

Card sorting continues to be useful for information architecture, though. When designing the structure of a website and its main and submenus, it can be difficult to decide what the categories should be, because the content of  a site can be unique or nonstandard. Additionally, polyhierarchy is not expected in submenus and could be confusing. Finally, website navigation is often not deeper than two or three levels, unlike many taxonomies that are often four or five levels deep and thus impractical to thoroughly design or validate with card sorting.

Wednesday, July 31, 2013

Tags and Categories

What does a taxonomy comprise and how does it work? Professional taxonomists may speak of “terms,” “nodes,” or “labels,” whereas most other people with a basic understanding of taxonomy might refer to “tags” or “categories.” A category is a well understood concept, and social media sites have made the notion of “tag” well known.

In addition to the different professional level of such jargon, there is also a distinction in meaning.  Ironically, it’s the professional terminology that is vague and the layman terminology that is more specific. Taxonomy “terms,” “nodes,” or “labels,” are all pretty generic and can all have various applications for different kinds of taxonomies, both for broad categorization and for specific indexing. “Tags” and “categories,” on the other hand, each tend to have distinct meanings. It’s not so much what they are, or even how they are organized, but rather how they are used.

Tags are for tagging.
That seems obvious. As for what is meant by “tagging,” that implies you put a tag on something. In fact, you can put more than one tag on something, and that’s typically encouraged in tagging. “Something” is typically an electronic file of some form of content, a document, image, video, database record, blog post, etc. Tags tend to be a brief label indicating what something is about. Tags can be very specific or relatively broad. Information professionals might prefer to call them “index terms.” An organized, alphabetized list of tags could serve as an index.

Categories are for categorizing.
This can also be called grouping or classifying. It implies putting something into a category, often represented as a file folder, whether an actual electronic folder path, or just a depiction of a folder icon. While categories have different levels of specificity, the name category implies a collection of things, so there is an implicit understanding that categories don’t get too specific. An organized structure of categories typically constitutes a hierarchical taxonomy.

Can something go into more than one category? In physical folders no (unless you make photocopy of the document for each folder), but in the digital world, often the answer is yes, but not always (again requiring the copying of files). It depends on the system, and it may involve some workaround. Even when it is possible to put a content item into more than one category, unlike tags, it is still preferable to have most content items assigned to only one category and a smaller number of them that may belong in two categories. For example, there may be a breadcrumb trail for the hierarchy of categories, and the breadcrumb trail may only take a single path. The idea is that the categories retain distinct meaning and usage through mostly distinct content.

Tags and categories together
Because tags and categories are different, it is possible to have both at the same time, especially if the categories are deliberately kept broad and the tags are relatively specific. Content management systems and digital asset management systems increasingly offer features of both categories and tags for managing content. In these cases, the challenge is to decide to what degree of classification to use the categories and to what degree to use the tags. That's exactly what I have done as a taxonomist on two recent consulting projects.

For the amateur taxonomist and indexer, one of the most common exposures to tags and categories is through blogs. Blogging software may permit the blog author to assign a tag or category to a blog post.  Whether the tags and categories are appropriately named and used is another issue, though. provides only one option, which it calls "Labels" and utilizes an icon for a tag in the blogging interface, but then displays them when published in the right margin under a heading called "Categories."  No wonder my "categories" don't look good; I had created them as if they were tags. Furthermore, the very specific subject matter of "The Accidental Taxonomist" blog makes its posts more suited for tagging than for categorizing. WordPress, on the other hand, gives the blogger both tools: tags and categories. If “The Accidental Taxonomist” blog eventually moves, you’ll know why.

Thursday, June 6, 2013

How Many Facets

Faceted taxonomies (taxonomies with attributes, dimensions, filters, etc. to limit search results based on the combination of selected criteria) are becoming increasingly popular with the support of web database technology. Unlike traditional hierarchical taxonomies, designing a faceted taxonomy first requires a decision on how many facets to create. There are various factors to take into consideration. 

What the content supports

The nature of the content is always the most important factor. It may seem ironic, but content that is more limited in scope can support more facets than content that it broad in scope. For example, an ecommerce site selling just computers, could have a relatively large number of facets by which to limit laptop computers: brand, price range, hard drive, screen size, operating system, processor brand, processor type, webcam inclusion, and online/in-store availability (9 facets). On the other hand, if a content repository comprises all kinds of articles, then there is not much else beyond “subject” and article type to classify them by (2 facets). (Other metadata fields, such as author, title, and date, may also be used to limit results, but these do not involve taxonomy terms.)

What the end-user user interface supports

More facets can be included, if they are stacked one above each other vertically, such as in a left-margin, than if they are displayed horizontally across the width of the screen. This is because horizontal scrolling is something users dislike and is avoided in content design, whereas limited vertical scrolled is acceptable.

Sometimes a website or intranet is created in a web content management system that does not give as much flexibility in taxonomy display. For example, SharePoint requires a horizontal list of facets, if the facets are to be used to filter content displayed in “columns,” where facet names are the column headers. Furthermore, SharePoint will by default create columns for document format type, content type, author, date created, and date modified. While you can hide these columns, if you want to use some of these defaults, that will limit the number of other descriptive facets for columns to about three or four.

Facets that limit search results are typically displayed in the left-margin, so more facets can be created. However, the number of facets should be limited so that all of the facet labels (although not necessarily all of their contents/facet values/terms) display by default without scrolling. The first 4-6 terms or values within a facet should be displayed to give the user a good understanding of what is in there, with a link or button to “show more.” Scrolling can be used when a facet category is expanded. So, what needs to be considered is the vertical space if all facets display at least some values, and if that does not fit, whether some facets can be collapsed by default. The example below of the facets for limiting people search results on LinkedIn shows the default display of two facets with the first 6 terms, one facet with all 5 terms, and 12 facets collapsed (an unusually high number of facets).

What the tagging process supports

For manual tagging, you have to consider who is doing the tagging, what their knowledge and experience is, what level of training is practical, how much time and effort can practically be devoted to tagging, and what the tagging user interface looks like. As with the end-user UI, the tagging interface also needs to display all facets and facet values in an easy-to-use manner. Usually, people who tag content for internal content management are not dedicated indexers. To simplify tagging and ensure that it is done correctly and done at all, for internal tagging there should not be too many facets for internal tagging (such as around 3).

Organizations which tag/index content for subscription sale, on the other hand, where content indexing is core to their business, will invest in dedicated indexers who can be given thorough training in assigning terms from multiple facets and will also check their indexing for quality. Thus, for professional indexing, a greater number of facets can be supported.

In automated tagging, it’s not so much a matter of how many facets, but rather how distinct the facets are and how easy they are for automated tagging. There are different technologies out there, but, in general, named entities/proper nouns are easier to distinguish than topical subjects. So, facets for author, location, department, product name, etc., are easy to classify automatically. Language, and a document type that is based on file format are also straight-forward for auto-classification. Subject or Topic could be catch-all for high-ranked keywords. If you want to create facets for different kinds of topics, though, such as Purpose, Activity, Significance, Origin, etc., the distinctions will likely be too challenging for an auto-classification tool.

Monday, May 6, 2013

Topics and Document Types in Taxonomies

It’s quite common in a faceted taxonomy to have a Document/Content Type facet (I’ll call DocType here), whose terms define what a content item “is,” (a report, a blogpost, a form, a contract, a letter, a policy, etc.) and also a Topic or Subject facet, whose terms describe what a content item is “about” (legal compliance, training, new business, insurance, company information, etc.) While usually it’s pretty clear-cut what belongs in the DocType facet and what belongs in the Topic facet, occasionally there are some ambiguous concepts, so asking the questions “what is it?” versus “what is it about?” helps in making the distinction.

Often the taxonomist can resolve ambiguity by editing the term so that a one-word generic document type is appended to a descriptive word. For example “Marketing” by itself is a Topic, but “Marketing Material” is a DocType. This kind of decision is reached only after looking at the set of documents and determining whether there is a significant number of them that are really marketing materials versus a significant number of them that are really about marketing (and there could be both). You then have to decide how far to go with this. You could force otherwise topical concepts into DocTypes by adding the word “Document” to the end of many terms. For example, “Compliance” becomes “Compliance Document”, and Client Management” becomes “Client Management Document.”  Depending on your overall content set and taxonomy design, this may or may not be acceptable practice.

Another complicating issue that may come up in designing such a faceted taxonomy is what to do if certain Topics only occur in certain types of documents. This is not unusual. While DocTypes such as Report, Evaluation, Meeting Minutes, Memo, Article, Review, etc., are rather generic and could all be associated with any number of the same shared set of Topics, other DocTypes that a customized for a specific content set are more limited in their application. For example, Topics for different types of approval to be used only with a DocType of Approval Letter, or Topics for types of product information to be used only with a DocType of Product Information Sheet.

There are two ways to handle this issue: 

1. Create rules permitting certain Topics available as options only when certain DocTypes are assigned
This requires that DocType be assigned (tagged, indexed, matched, etc.) to a content item first, before the Topic is assigned. This can be seen as: the Topic is dependent on the DocType, or DocTypes terms drive the Topics, or the DocType takes precedence over the Topic. This is feasible with these facets, since a content item can be assigned only on DocType (in contrast with the possibility of getting assigned more than one Topic). What gets complicated, though, if there are additional rules between other facets, with the terms in one facet driving the availability of terms in other facets, such as File Type, Source, Department, etc. 

2. Merge the DocType and Topic facet into a single facet
This may seem extreme, but it could be practical, especially if it’s easier for the end-user. It works if the there are not so many Topic terms, such as not many more than the total number of DocType terms, the majority of them are applicable to a single  DocType term, and a user interface can be designed that supports an expandable/collapsible hierarchy, so a user clicks on a DocType and the applicable Topics underneath it display. Traditionally taxonomies are hierarchical after all. If a Topic term is valid for more than on DocType, then a valid polyhierarchy results. There could still be a distinct facet for File Type/Format (such as HTML, text, image, PDF, etc.), for which there would be no ambiguity, in contrast to the occasional ambiguity between DocTypes and Topics.

In either case—whether rules for the terms of one facet driving the availability of terms in another or whether a merged expandable hierarchical facet is created—collaboration is needed between the taxonomist and the technical experts who configure the implementation of taxonomy in the content/document management system.

Monday, April 22, 2013

Capitalization in Taxonomies

The question often comes up: what is the preferred style for the capitalization of taxonomy terms? Other than all proper nouns being capitalized, there is no strict rule for generic terms. In making the determination, it’s important to address the following questions. What kind of taxonomy is it? How will it be used? Who are the users, and what might they be accustomed to or expect?

The ANSI/NISO standard Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies states: “predominantly lowercase terms should be used for terms in controlled vocabularies” and continues: “capitals should be used only for the initial letter of proper names, trade names and those components of taxonomic names, such as genus, which are conventionally capitalized.” But remember that ANSI/NISO Z39.19 comprises guidelines and not strict requirements, so the stylistic matter of case does not have to follow ANSI/NISO Z39.19, if a house style dictates otherwise.

Note that there are three options, not just two for non-proper nouns/names, as these explanations themselves illustrate:
1.    all lower case (including the first letter of the first word)
2.    First letter of first word upper case
3.    Title Case (First Letter of the First and Main Words Capitalized)

While the distinctions between “controlled vocabularies,” “thesauri,”, and hierarchical or faceted “taxonomies” can be blurred, these different types do tend to have different practices for capitalization.

A “controlled vocabulary,” as the word “vocabulary” might suggest, is a list of terms (as single words or phrases), similar to what might be found in a glossary, with the possible added feature of synonyms/variants for each preferred term. Capitalization, therefore, could be expected to follow dictionary rules and thus not used except for proper names. A “synonym ring” type of controlled vocabulary, in which no terms are designated as “preferred” and none are even displayed to the user, has no need for any capitalization.

A “thesaurus” is a more complex type of controlled vocabulary with hierarchical and/or associative relationships relating various terms to each other. What are called thesauri tend to be more term-focused than hierarchically focused, and they tend to be large with many detailed terms. The terms can be quite specific, and proper nouns can be mixed in. Thesauri have traditionally been used by indexers to manually index multiple documents consistently over time. The resulting display of terms associated with content for the end-user to browse through is a type of index. Indexes (such as those at the backs of book) often follow the style of lower-case entries for non-proper names, too. If the terms are numerous and specific, they will appear to be and used as “index terms” rather than “categories.” Thus, if it’s called a thesaurus, it will more likely have terms in lower case. The choice of initial capitalization for a thesaurus, though, would not be incorrect, and is probably becoming more common, just as initial capitalization is becoming more common in main entries in back-of-the-book indexes.

A “taxonomy” implies a hierarchical classification or categorization of concepts. When we think of categories we think of labels or headings with subcategories. Headings in general tend to have initial capitalization or title capitalization. Thus, if it’s a strictly hierarchical taxonomy, where all terms are interconnected into a single hierarchy or a limited number of hierarchies, then it will more likely have initial capitalization or title capitalization. Such capitalization is particularly common on the relatively smaller/less detailed taxonomies that are proliferating on websites, intranets, and content management systems. It fits in with the web design style of capitalization on headings and categories.

In faceted taxonomies, which have become more popular in web/online taxonomies, proper names can be separated into their own facet(s), and confusion between proper names and generic terms is reduced. However, I would still recommend only the first letter of the first work capitalized, rather than title case, to minimize any confusion with proper names. The facet name itself, however, could be it title capitalization, since it represents a category heading and not a term for indexing. In fact, it might even be desirable to distinguish the facet labels from the values/terms within each facet by use of a different case style.

A mixed style of different capitalization at different levels is possible in hierarchical taxonomies, too. But I would recommend only the top terms, if any, have a different capitalization style. It would not be a good idea to have only the bottom level terms (“leaf nodes”) in a different case style, because they could change. If you decided that a leaf node should later have narrower terms added, you wouldn't want to have to worry about changing the case of the term. A good application of the mixed capitalization style is if the top level terms were not actually to be used in indexing/tagging but are really just categories/groupings of the actual index terms, which in-turn are arranged hierarchically underneath. (Other typographical methods of distinction could also be used for any non-indexible top-level categories.)

In sum, all-lower case is most appropriate for non-displayed controlled vocabularies, any controlled vocabularies or thesauri that integrate proper nouns into the same hierarchies as generic terms, and large thesauri used to support manual indexing. Initial capitalization is fine for end-user browsable hierarchical taxonomies on the web. Title capitalization is OK for facet labels or the top categories in a hierarchical taxonomy. Whichever style is chosen, however, should be applied consistently.

Tuesday, April 2, 2013

Taxonomies vs. Classification

A question had come up in one of my classes on how classification differs from taxonomies/thesauri. As part of an assignment to find thesauri on the web a student sought to find “how the Federal Government classifies its publications and was expecting to find a very elaborate Thesaurus … and instead found… the Superintendent of Documents classification system,” and so the student asked how that classification system fits into the scheme of definitions for taxonomies, controlled vocabularies, and thesauri. That I will attempt to explain here.

We are familiar with classification schemes used to catalog and locate books and other materials in libraries, such as the Dewy Decimal system or, for academic libraries, the Library of Congress Classification (letter-based call “numbers”). In addition to the U.S. federal government’s “Superintendent of Documents” classification system, many other national governments an international organizations also have their own document classification schemes, and states and provinces may have modified versions. There are also classification systems for industries, such as the NAICS (North American Industrial Classification System) codes. Corporations with large volumes of documents may have their own internal document classification systems.

I sum up the differences between classification schemes and taxonomies/thesauri as follows:


  • used for books, monographs, documents, reports, contracts, or other media
  • developed for the classification of physical items for their location on shelves, drawers, or filing cabinets and physical file folders
  • based on alpha-numeric codes
  • involves assigning an item only one classification code
  • manually assigned to each item
  • classification codes may include additional information, such as date, title, author, or publishing department information within the same classification code
  • rarely gets changed (due to the pre-established numeric code hierarchy)
  • helps document managers and librarians organize documents and helps users locate pre-identified documents and materials

Taxonomy/Controlled Vocabulary/thesauri:

  • used for articles, images, electronic files, paragraphs or sections of text if separated out as digital content units
  • used primarily in online/digital space
  • based on descriptive words and phrases (terms). Codes, if any, are secondary.
  • involves assigning an item multiple taxonomy terms
  • manually or automatically (auto-tagging, auto-classification, etc.) assigned to content items
  • taxonomy terms restricted to subject information (not to include date, title, author, publishing department, etc.)
  • can easily be revised and updated
  • helps users identify which content items they want

Another way to think of the comparison:
is for: where to put things/where does this document or item go.
is for: how to describe content/what is this text, image, or other media about.

So, while both classification and taxonomy are related and are within the realm of information science, they are really quite different. Since they serve different purposes, they can actually co-exist and both be applied to the same corpus of documents. Libraries utilize both at the same time: a classification system (the Dewy Decimal or Library of Congress Classification call numbers on books and media) and a form of a taxonomy in the catalog subject headings (usually Library of Congress Subject Headings, which are not to be confused with Library of Congress Classification).

Taxonomy and classification may each involve different people, too: catalogers for classification and taxonomists for taxonomies. While some information professionals may do both, you cannot assume that all catalogers know how to create taxonomies or that all taxonomists understand classification. There is, of course, a larger and growing need for taxonomies, in contrast to classification and cataloging systems, as more content migrates online. Furthermore, taxonomies are more adaptable to change and thus in need of continual maintenance, in comparison to the rather static classification systems. Many catalogers are taking an interest in learning about taxonomies these days.

Taxonomists who understand something about classification can also put that knowledge to use. There are many large corporations and agencies with documents organization by customized classification systems, which are now migrating over to dynamic online content/document management and taxonomies. The legacy classification systems then need to re-formed into (or replaced by) taxonomies, and then the legacy codes need to be mapped to the new taxonomy terms to ensure the continual retrieval of legacy documents. I did this kind of work as a consulting project for a large financial institution not long ago. There were thousands of legacy alpha-numeric codes, most of which combined both a document type attribute and a subject matter attribute into a single code, a typical feature of classification codes when a document can get only one code. A taxonomy, on the other hand may have one facet for document type and another facet for subject, and a document can be assigned multiple subject taxonomy terms in addition to the document type term.

As long as there are physical books, documents, and media, there is a need for classification, but if the entire content repository is digital, then taxonomies are the way to go.

Monday, March 11, 2013

Testing Taxonomies

As mentioned in my previous blogpost, “Evaluating Taxonomies,” taxonomy evaluation and taxonomy testing differ. While the evaluation of a taxonomy by a taxonomist is needed when a taxonomy is created by non-taxonomists (such as by subject-matter experts instead), testing of a taxonomy, on the other hand, is recommended in all cases, no matter who created the taxonomy. Following is an overview of the different kinds of testing that can or should be performed on a taxonomy prior to its implementation.


Card-sorting is probably the best known kind of testing, especially now that the prevalence of online card-sorting tools facilitates set-up and enables remote participation. It is not necessarily the best kind of testing for all situations, though. Card-sorting serves to test categorization schemes, so while it is suited for hierarchical taxonomies, it is not so appropriate for faceted taxonomies, especially with regard to how the facets are to interact with each other. It is possible, though, to card-sort test an individual facet, if that facet comprises an internal hierarchy of terms.

There are two kinds of card-sort tests, open and closed. In open card-sorts, the testers group concepts/topics together and then assign a broader category of their own; whereas in closed card sorts, the broad categories are already designated, and the testers merely categorize the specific concepts/topics within those pre-determined categories. Open card-sorting, if chosen, is therefore done earlier in the taxonomy design process, when broad categories are uncertain. A single taxonomy project may have either or both kinds of card-sorting depending on where the greatest need is for this additional input of information. Testers could be test end-users or they could be stakeholders, depending on the needs of the test.

Card-sorting is actually not really a kind of taxonomy testing but rather a form of taxonomy idea testing. Card-sorting is not performed on a completed taxonomy to test it but rather to test ideas of categories/hierarchies which later will be combined to create the taxonomy. Therefore, card-sorting is not an alternative to the other kinds of testing described below, which may subsequently be done.

Use Testing

Use-testing or use-case-testing is a necessary step after a draft taxonomy is built or nearly completed but before it is finally implemented, allowing for revisions to be made based on the test results. It is at this point that the taxonomy is put to the test to see if it will perform as hoped in search/retrieval and (if applicable) for manual tagging. This type of testing might also be called taxonomy validation.

A cross-section of different kinds of test users should be recruited to prepare several typical use cases and perhaps one especially challenging use case of content search scenarios. The user is then presented with the taxonomy (which can be in any format at this stage, whether on paper, as an Excel file or as test web page) and asked to browse the taxonomy to look for terms under which the content for the use search scenario might be found. The user performs the test, either browsing in the tester’s physical presence or via screensharing with verbal narration of what the user is doing and why.  The test administrator takes notes regarding any problems in finding taxonomy terms for the use case. These findability problems should be considered as requirements for additional terms, additional nonpreferred (variant) terms to point to existing terms, or perhaps more polyhierarchy or associative relationships to help guide the user to find the desired concepts.

If the taxonomy is to be used for manual tagging or indexing, then a second, different set of use testing is needed, whereby users who perform this function should test the taxonomy for indexing of typical and challenging documents that they tend to deal with. Rather than coming up with use “cases”, the test-user-indexers merely need to come up with actual documents. The documents should represent a good cross-section of the various document types indexed. This exercise is even more straightforward than the user testing for finding content, so it could even be performed offline without the test administrator present, as long as the test-user-indexer takes good notes.

A-B Testing

In A-B Testing, the test-users are presented with two different possible scenarios and asked which they prefer. When comparing two different taxonomies or parts of taxonomies, only one or two variations should exist between the two that are compared to make the test clear-cut. You may set up a series of A-B test pairs to compare multiple variations. This kind of test is comparable to what an optometrist does for vision: “Which is better, A or B?”  Since only one or two differences should be compared and tested at a time, A-B testing is most suitable to compare proposed top-level categories, rather than getting into the depths of a taxonomy, where it is not practical to conduct a detailed term-by-term comparison. Thus, A-B testing focuses on high-level structural design, navigation and browsing, and not the effectiveness of finding and retrieving content.

A-B Testing can be done at any time in the taxonomy design and build process. It is also very useful when considering a taxonomy redesign for comparing the existing taxonomy (A) to a proposed change (B). A-B Testing is usually done by presenting the test users with graphical or interactive web page mock-ups. I’ve created the B image to an existing online A image, by taking a screenshot of A and then edit it in Microsoft’s Paint accessory. Although each individual A-B test is simple, deciding what to compare and how many comparison tests to make needs to be determined, since each test takes time and resources.


Taxonomies should be tested, but it’s not true that any test is good. Different tests are for different purposes and fit into different stages of the taxonomy process. An inappropriate test or inappropriately timed test can be a waste of time and money.

Monday, February 25, 2013

Evaluating Taxonomies

In my last blog post, “Taxonomy Management Consulting,” I mentioned that more organizations now have taxonomies, so the need is shifting somewhat from designing and building new taxonomies to managing existing taxonomies. It might not be that simple, however, if the existing taxonomy was created and never used, created for a slightly different purpose or different content, or created by those not sufficiently knowledgeable in taxonomy design best practices.  I often find that an organization that has taxonomy consulting needs typically has some pre-existing taxonomies, but they are not adequate for one reason or another.

Any pre-existing taxonomies are important as part of a taxonomy development or redesign process and should be carefully considered. Whether pre-existing taxonomies will be only a source of terms for a new taxonomy or actually the basis of a new taxonomy with some editing depends on how structured, comprehensive, and sound these pre-existing taxonomies are.
  • Structure: Pre-existing taxonomies may be of the type that is a simple flat list of terms with no hierarchy.  These are good sources of taxonomy terms but are rarely the basis for the taxonomy.
  • Comprehensiveness: Often existing taxonomies cover only part of the scope of a desired full or enterprise-wide taxonomy,  in which case they will serve as part of the new taxonomy. 
  • Soundness: This concerns to what extent the taxonomy is conforms to standards (such as ANSI/NISO Z39.19) and general best practices, so that it ought to work well with the content it is intending to reference. This is where taxonomy experts can come in and make such  determinations.

Evaluation Criteria 

Evaluating a taxonomy for soundness typically involves checking off or rating the taxonomy against a set of pre-defined criteria regarding terms, inter-term relationships, and overall structure and design. Some of the most important criteria include the following:
  • Terms should be unambiguous and clear, yet not too wordy and long. If the taxonomy will be displayed for browsing, then terms should begin with key words and those that come under the same broader term should be in a somewhat consistent grammatical format. 
  • Hierarchical relationships should conform to the ANSI/NISO Z39.19 standards of conforming to only one of the three types: generic-specific, instance, or whole-part, with perhaps limited exceptions in a corporate taxonomy that are intuitively logical and justified.  (See my blog post “Deviating from Taxonomy Standards”).
  • Overall structure and design involves issues include the number of narrower terms for a broader term not being too few nor too many (such as 3-20), and the depth of the taxonomy being somewhat balanced and not too deep. For example, three levels deep in some places and four levels deep in others is OK, but two levels in some areas and five levels deep in others is not a well-balanced design.

Evaluation vs. Testing

Evaluating a taxonomy is not the same as testing a taxonomy. Testing a taxonomy involves using sample content and sample users in a controlled manner and can take considerable time and effort, so should not be done until after a taxonomy is determined to be generally sound. Evaluating a taxonomy, on the other hand, is to determine if it’s well constructed regardless of the content or users. Testing focuses on the specific application and use of the taxonomy and will be the topic of a future blogpost.

Taxonomy vs. Web Usability Heuristic Evaluation

Even if a numeric rating scale is used, the process is still more judgmental than scientific, and as such may be referred to as a “heuristic” analysis or evaluation. A “heuristic method” generally means evaluation, experimentation, or a trial-and-error method to find something out. The designation of heuristic evaluation has been used in website usability evaluation and from there has been carried over into taxonomy evaluation. User experience expert Jakob Nielsen first introduced the idea of heuristic evaluation to usability design back in 1990, described in his blogpost of 1995: “How to Conduct a Heuristic Evaluation.”

There are several differences, though, between taxonomy evaluation and web user interface evaluation. Although user testing of websites is not that much different from the testing of taxonomies, evaluation of taxonomies requires a more critical and analytical understanding and approach. Website usability evaluation does not require usability design experts, but taxonomy evaluation does require a level of expertise. Nielsen refers to “evaluators”, not experts, who are not much different from user testers. (Rather, the procedures in usability evaluating and testing differ.)

Another difference between website evaluation and taxonomy evaluation is that a website, even if a test dummy site, will have content, even if just mock-up pages with partial filler text, because navigation and content are integrally combined on websites. When a taxonomy, on the other hand, is at the evaluation stage, it is not implemented/linked to content, which makes it more difficult for the non-expert to evaluate. It might appear to look good on paper but not function well when implemented.

Nielson wrote: “Heuristic evaluation involves having a small set of evaluators examine the interface and judge its compliance with recognized usability principles.” If the evaluators are not experts, then it’s easier and more affordable to have multiple evaluators. When a taxonomy requires evaluation, typically just one taxonomy expert is hired, but if you can afford two separate independent expert evaluations of your taxonomy, that’s all the better.

Tuesday, January 22, 2013

Taxonomy Management Consulting

I recently wrote an article on taxonomy management for the online magazine FreePint. By “taxonomy management” I mean taxonomy maintenance, governance, and long-term planning.  I’m not going to repeat that article here, because you can look it up. The short version is available without a subscription: “The Care and Feeding of Taxonomies: Taxonomy Management.” In summary, in the long version I discussed:
  • The reasons for managing a taxonomy
  • The distinction between taxonomy development and taxonomy management
  • The parts of an organization responsible for taxonomy management
  • Factors in selecting a taxonomy management software system
  • Components of taxonomy editorial policies
  • Components of taxonomy maintenance procedures and a governance plan

Writing this article got me thinking about the role of taxonomy consulting in taxonomy management. As more and more taxonomies get created, over time, the need for new taxonomy creation may diminish, while the need for better taxonomy management increases. This should be good news for those in the taxonomist profession, especially for those who serve as in-house staff taxonomists. As for those of us who are taxonomy consultants, there is still a role, just a slightly different one.

The design and creation of a new customized taxonomy is an appropriate task for an external consultant because it:
  • is a limited-term project that needs extra assistance while existing staff probably lacks the time
  • requires a specialized skill that perhaps no one on staff has
  • can benefit from an external point of view that is not biased, but can appreciate the perspectives of various users.
The ongoing maintenance of a taxonomy, on the other hand, is best suited for an internal staff taxonomist or information specialist, who:
  • can be immediately responsive to changing needs or circumstances
  • is familiar with the subject matter of the organization when it comes to additions or changes of highly specific taxonomy terms
  • can devote at least a little time each day or week as needed, but the time can be flexible.
Consultants still have a role in maintenance. They can study the issues and write the taxonomy editorial policies, indexing policies, maintenance plans, governance plans etc. In fact, this is where taxonomy consultants really serve as consultants, and not merely as taxonomy designers and initial developers. Sometimes I think the designation of “consultant” is a bit of a stretch for someone spending most of their time actually building taxonomies. On all of my taxonomy projects, however, I also do provide advice and suggestions, so do some consulting all along. Taxonomy management, though, relies more heavily on actual consulting services.

Even though the needs of many organizations are shifting from taxonomy design and creation to taxonomy maintenance and revision, there exists a lot more information (books, articles, workshops, presentations, etc.) on taxonomy design than on taxonomy management. The relative lack of written sources on taxonomy management is another reason why a taxonomy consultant can be especially helpful. Finally, like taxonomy design, taxonomy management plans and procedures also need to be tailored to the circumstances of a specific organization.

When I create a taxonomy, I take a personal interest in it and hope that it will have a long useful life, so I want to create taxonomy guidelines and maintenance plans that are part of taxonomy management to ensure that my good work is kept up to date. If I don’t create an organization’s taxonomy, I am still just as interested in providing guidance for improving and maintaining what taxonomy exists, because my ultimate interest in taxonomies is seeing them get used and being useful.