Sunday, May 13, 2018

Creating Subject Terms for a Faceted Taxonomy

Faceted taxonomies—those that allow users to limit or filter search results by selecting terms or attributes from each of several types/aspects—are becoming increasingly common. They are easy and effective for end-users with various abilities in searching. When it comes to designing facets, some of the facets and their terms for a content collection may be obvious: Document or Content Type, Location, Audience, Purpose, etc. Creating a facet for Subject, for tagging topics the content is about, however, can be quite daunting.

Some faceted taxonomies do not have a Subject facet. Product taxonomies, such as for ecommerce, don’t have Subjects, but rather product categories. Enterprise taxonomies, such as those used in enterprise content or document management systems, also typically don’t have a Subject facet, but rather they have detailed terms in the Document Type facet and may have facets for Business Activity/Function, Department, Line of Business, or even something for Life Cycle/Phase/Stage.

The Subject facet is important, and can be quite large, for taxonomies for tagging and retrieving of content in a collection, library, or repository of published articles, research studies or reports, manuals, presentations, speeches, educational/training materials, images, videos, etc. If a large number of terms are needed to adequately cover the breadth and depth of the content, the Subject facet may comprise its own internal hierarchical taxonomy or thesaurus.

Coming up with the numerous Subject terms is more work and may require a different approach than for the terms of the other facets, which may be based on user needs and expectations. The terms in the Subject facet need to be based primarily on the subject of the content items being tagged. Other techniques for developing taxonomy terms, such as stakeholder interviews and search query logs, are helpful for other facets, but not so much for Subjects.

A taxonomy is built in a combination of top-down (identifying facets and top terms) and bottom-up (identifying the individual terms needed for indexing) tasks. The Subjects are developed a little bit top-down but more bottom-up. The top-down approach for Subjects starts with identifying the subject domain and scope and then any primary divisions in that domain, based on familiarity with the subject area and the content collection. The bottom-up approach involves looking a numerous individual content items to determine the main topics they are about and developing terms for these topics.

Determining what content items are about and what terms describe is the activity of descriptive indexing. I prefer to use the word indexing than tagging here, especially in absence of a taxonomy/controlled vocabulary, which has yet to be created, because it is an analytical task. (See my earlier blog post “Tagging vs.Indexing.”) So, at this stage it may help to have someone who has experience as an indexer do test-indexing of a rather large, representative sample of the content. Guidelines should be established at the start, such as each document is to be assigned index terms for the document as a whole, not for each section, and that a document should be assigned no more than three Subject index terms, for example.

The terms will need to be individually reviewed so that similar terms can be considered for merging into a single concept (and alternative labels/synonyms might be created). To keep the number of terms more manageable for review, it’s best to review and edit the terms periodically, before completing the test-indexing of the entire sample set of content. Thus, developing the taxonomy of Subjects by means of test-indexing is an iterative process. You will probably see trends, patterns, and possibly subcategories emerge from the terms as you collect them. The initial terms that come out of test-indexing can be quite specific and then made broader later. It’s easy to edit specific terms into broader terms, while it is not possible to go the other direction without reviewing the content again. In some cases, you can identify the key terms from the tile of a document or the caption/description of an image, but often for text you need to read headings/subheadings and skim the text or look at an image.

Ideally, this test-indexing can be saved, so when actual indexing is done with the final taxonomy, the indexing work does not have to be repeated. But often this is often not possible. So, test-indexing should not be too thorough or laborious. Before I became a taxonomist, I was an indexer, so I am quite efficient at this task, and I enjoy it.

Keep in mind that a taxonomy should continue to get updated even after it is implemented. This is especially the case for the Subjects, as new content will introduce new topics not yet included within the Subject terms. Thus, the test-indexing need not be completely comprehensive. It is understood that more Subject terms will be added as needed later. What is important is that new terms are added only in accordance with established policy.

Monday, April 30, 2018

Related Terms in Taxonomies and Thesauri

One of the benefits of a taxonomy is that there are relationships between the terms to support navigating to find the most suitable term. This could be the multi-level hierarchical browsing from broader terms down to more specific narrower terms. In a strict definition of a taxonomy, the only required type of relationship between terms is hierarchical. But sometimes it may be desirable to have relationships other than hierarchical.

ISO and ANSI/NISO standards for controlled vocabularies are very explicit on the criteria for hierarchical relationships between terms. The narrower term must be a specific type of, an instance of, or an integral part of its broader term, and it must be so in all circumstances, not just sometimes. If a pair terms seem as if they should be related, but their relationship does not meet the criteria for a hierarchical relationship, then they could have an associative relationship instead, which is also known as “related term” or abbreviated as RT. Examples include Schools as related to School busses, Computers related to Operating systems, and Business development related to Marketing
Those familiar with controlled vocabularies, know that the RT relationship is a standard feature of a thesaurus, whereas it is not common in taxonomies. However, the distinction between a taxonomy and a thesaurus can be blurred, and the kind of controlled vocabulary that is designed and implemented can have features of both, serving the needs of the users and the nature of the content and indexing. Therefore, taxonomists should have a good understanding of the associative relationship, in case the need arises and the system supports it.

The creation of an RT relationship is more subjective than the creation of hierarchical relationships. Even the standard ANSI/NISO Z39.19-2005 (r2010), in section 8.4 Associative Relationships admits "The associative relationship is the most difficult to define." The standard states that the relationship is created "on the grounds that it may suggest additional terms for use in indexing or retrieval." The fact that the relationship may be created but is not required in many circumstances leaves it up to the best judgment of the taxonomist.

The only required circumstance for the RT relationship (in a thesaurus or taxonomy that has associative relationships) is between two terms that share the same broader term and they have some overlapping meaning. Examples include Boats and Ships (sharing the broader term Vessels), or Tablets and eReaders (sharing the broader term Mobile devices), or Coats and Jackets (sharing the broader term Outerwear). But there would not be a related-term relationship between Coats and Mittens, even though they are both narrower terms of Outerwear, because they don’t have overlapping meaning.

More often, however, the RT relationship is created between terms with different broader terms or even from different hierarchies entirely. The ANSI/NISO Z39.19 standard provides guidance on this only to the extent of when the relationship may be created. Whether creating the relationship is a good idea or not is up to the taxonomist’s discretion and ultimately depends on what the taxonomist thinks would be helpful to most users.
Following are examples of possible RT relationships between pairs of terms:
Baseball RT Baseball players
Ventilation RT Fans
Bacterial infections RT Antibiotics
Appliance repair RT Appliances
Plastics RT Elasticity
Timber RT Wood products
Psychology RT Psychologists
Literature RT Books

There are different users of a taxonomy/thesaurus with different goals. Depending on what they are looking for, some users would welcome the guidance of a certain RT relationship, pointing them to a related term they had not considered but find helpful (such as E-commerce RT Online shopping); whereas other users are not interested in the related term and may find extra RT relationships getting in their way. The goal is to provide RT relationships that are probably relevant and helpful to the majority of users or “the average user,” while realistically not expecting to serve all users.
The following are examples of RT relationships that are not likely to be helpful to most users and thus better not be created:
Newspapers RT Advertising
Germany RT World War II
Athletics RT Doping

A good way to become more familiar with best practices for creating RT relationships is to browse published thesauri with these relationships. Thesauri that are available publicly to browse on the web include the ERIC (Educational Resources Information Center) Thesaurus, MeSH (Medical Subject Headings), UNESCO thesaurus, and the NASA thesaurus. These and others can be found on the American Society for Indexing’s website page OnlineThesauri and Authority Files. Note that related terms might be indicated as RT, Related concepts, or See also links.

Creating associative relationships is part of the creative work of taxonomists, but taxonomists must remember that serving the user is the primary goal.

Thursday, March 15, 2018

Subject Searching: Why a Taxonomy, Thesaurus, or Controlled Vocabulary Still Helps in the Age of Search

Subjects, topics, index terms, keywords, controlled vocabulary, thesaurus, taxonomy. These all refer to an organized, precise way to find and retrieve desired information, where that information has been indexed to terms. Indexing content with subject terms can be manual or automated, but in either case the focus is on what the content is about, not what words appear in the text. The subject terms represent unambiguous concepts, which may have synonyms, but synonyms are often included as cross-references to redirect to the preferred term name and thus to the same set of content. Before the era of digital content, subject categories or index terms were the only method to find specific information, such as in a back-of-the-book index or business categories in the yellow pages.

Using subject terms to find desired content contrasts with using a search engine for full text search. Search is based on the occurrence of words, not concepts, so appropriate results can be missed if they use different wording for the same concept, and inappropriate results can be retrieved if a word has multiple meanings. The accuracy of search, without the additional support index terms/subjects, is dependent instead on the sophistication of algorithms. The combinations of algorithms have improved only slightly in the past decade or two. What has made a bigger difference in retrieving good results through search (without subject indexing), is that in many cases the volume of content has grown, and when search results are arranged by relevancy, a larger number of initially displayed search results are satisfactory.

There are two issues with this kind of search. Ordering results by relevancy is not always the preferred option. Sometimes searchers are interested in timely stories, so they want their results to be ordered by date, newest first, but when relying on a search engine, newer results might not all be insufficiently relevant.  Secondly, such results are good for the searcher who only wants to get some or enough information on a topic. If instead, the searcher wants to perform an exhaustive search and retrieve everything available on a topic, there will likely be relevant content that is missed in the search retrieval because it was worded differently. Indexing with subject terms improves both precision (accuracy, where incorrect content is not retrieved) and recall (comprehensiveness, where appropriate content is not missed).

The role that index terms play in the search process has evolved. Originally, researchers started with browsing a full list of subjects that may have been arranged alphabetically (as a traditional book-style index) or hierarchically (as a taxonomy), and they navigated the index to find more specific subdivisions as aspects of the main heading, or they navigated the taxonomy to drill down to the most specific term. As the volume of indexed documents or other content items has grown over the years, browsing and selecting a term from a taxonomy or thesaurus is often no longer as practical or sufficient. An individual term may have too many records indexed to it. Furthermore, many taxonomies and thesauri have grown too large to easily browse.

So, instead of taxonomy terms being used as the primary starting point to find desired content, taxonomy terms are more often being used to narrow or filter search results.  The user executes a search in the search engine, and if they get too many results, they can limit or filter the results by various aspects listed in the margin, including by indexed subject. (Other aspects could be date, document type, author, source etc.) The subjects can display in order of frequency of occurrence on the records in the search result set, and the user can select among them, rather than having to browse the entire taxonomy or thesaurus.

Use of subjects and other attributes to limit search results is becoming very common across various implementations, so most people are familiar with using them, such as enterprise search systems to find internal corporate documents, ecommerce websites for selecting products, library databases for selecting research articles.

The use of subjects to limit search results is similar to a faceted taxonomy, although the designation “faceted taxonomy” typically refers to a taxonomy where different types of terms are grouped into multiple facets. In other words, a faceted taxonomy involves several facets or filters, whereas a traditional taxonomy or thesaurus may comprise a single facet or filter, which may be used in combination with other, non-taxonomy filters.

I will be exploring and demonstrating this topic, specifically in the case of library subscription databases, in a presentation “Customer Focused Thesauri,” in addition to a pre-conference workshop on taxonomy creation, at the Computers in Libraries conference in Arlington, VA, in April.

Sunday, February 25, 2018

Taxonomies for Filtering and Sorting

Taxonomies are versatile and may be used for various purposes. Originally designed to support hierarchical browsing of topics linked to content, they also may be implemented to support more accuracy in searching. Most discussions of taxonomies have focused on browse and/or search, but taxonomies may function in additional ways: enhancing filtering and sorting.


Taxonomies structured into facets serve a combination of search and browse, and thus serve what is often called “faceted search” or “faceted browse” (as described in a previous blog post, Faceted Search vs. Faceted Browse). However, it’s simpler, more accurate, and more helpful in understanding facets to consider a faceted taxonomy as serving a distinct role from browsing or searching, that of filtering.

Filtering is a common function which is not limited to use with taxonomies. Filtering can be done by non-taxonomy attributes, such as keyword, author, date, etc., if these are set up as metadata and implemented as filters. We see filters in situations which lack taxonomies in options in the email inbox or lists of documents in file management user interfaces. We also find filters on documents in SharePoint libraries and other content/document management systems, which may include taxonomies. In a user interface, the icon for filtering is a funnel (where you can imagine that it is lined with a cone-shaped filter paper).

Filters may be known by other names, such as “Refinements”/“Refine by” or “Limit by.” These designations may be used interchangeably, although they tend to be used different circumstances. “Filtering” may be done on search results or on a complete set of records, such as a list in a spreadsheet or table. “Refining” or “limiting,” on the other hand, would usually be performed only on the results of an executed search, as a further refinement or limiting factor on an initial set of search results which turned out too large. “Refining,” furthermore, suggests a more careful search process, so this name is more often used in research databases or other repositories of articles and resources.

A relatively small faceted taxonomy comprising short lists of terms for each facet/filter, from which the user can select from a displayed list or drop-down, is both easy to use and, with proper tagging, can achieve accurate retrieval results.


Sorting is done on content in a spreadsheet or table, where data on content items is in different columns, with sorting done by an attribute of an individually selected column. Sorting could be by numeric order, by alphabetic order, by date, or by the mere presence of a binary value.  Indeed, most sorting does not involve taxonomies, but it can.  If a column is for “Topic” and items have been tagged with taxonomy terms, then the items can be sorted by taxonomy term topic.

The function of sorting with taxonomy terms may not be quite as common as filtering, since it is not done on search results but only on data in a table or spreadsheet. However, in many situations content items are presented this way: content in spreadsheets and databases, messages in an email inbox, content items in SharePoint libraries and various kinds of content management systems, and many other applications. Furthermore, it’s often simpler to sort than to filter.

Sorting is nevertheless a function associated with taxonomies, at least in the definition of taxonomy in the ISO standard for controlled vocabularies. ISO 25964-2:2013 in section 3.83 defines taxonomy as a “scheme of categories (3.5) and subcategories that can be used to sort and otherwise organize items of knowledge or information.”

The following screenshot from SharePoint shows the availability of both sorting (by clicking on the down-arrow next to a column name, such as Topic), and filtering (by selecting Topics in the filter pane on the right).


As taxonomies are versatile, the same taxonomy can be used for multiple purposes: browsing, searching, filtering, and sorting. However, a taxonomy design usually can be optimized for the user experience of only one or two implementations. So, a taxonomy that delivers a great user experience for hierarchical browsing, might not be best suited for filtering or sorting, or vice versa. Filtering is more accommodating for various taxonomies than is sorting, since it may not be necessary to display the entire taxonomy in filters.