Guest Post by Richard Joltes Software Developer, Content Discovery and Management, IBM Enterprise Content Management
As any I.T. veteran knows, management of unstructured data has
become increasingly difficult over the years. Web pages, PDF files, Office
documents, and email messages can (and do) accumulate within file systems and
other repositories at an alarming rate, consuming storage and other resources. Some
organizations adopted a ‘save everything’ model that has resulted in huge file
shares or email archives that likely contain only a small percentage of usable
data. Finding files or messages in these archives can be nearly impossible,
especially in situations where unmanaged repositories or departmental file
shares are involved. Additionally, this storage model can result in legal
headaches if a lawsuit or other action results in a demand to produce all
documents related to a given case. Searching vast archives of potentially
relevant materials can consume significant resources over a long period of
Automated content classification can help mitigate such
problems, but groundwork and planning, as well as a solid understanding of the
content to be classified and how it can be logically divided into various
categories, are needed in order to insure success. As a good starting point for
this process, consider the following questions.
What’s my taxonomy?
You can’t categorize documents, or anything else for that
matter, without a coherent list of known categories and criteria that
distinguishes one from the others. This list, along with the characteristics of
each element, is known as a taxonomy,
and most people make use of them in everyday life without even knowing it. We
instinctively know the difference between a laptop and desktop computer, and
most people can articulate what those differences are with relative ease.
The same is true when documents are involved. What’s the
difference, for instance, between an “Accounting and Finance” document and one
from “Engineering”? Are there key phrases, terms, and intents that could help
an employee distinguish one from the other with a reasonable level of
confidence? If the answer is yes, then it is likely that software such as IBM
Content Classification™ will be
able to distinguish one from the other once it has been trained to recognize
Certain categories may be more problematic: “Legal” and
“Regulatory” may involve significant overlap of intent and language, for
instance. The rule of thumb is simple. If a human can’t classify documents into
selected categories with a high level of certainty, then a computer won’t be
able to either. It’s a simple as that.
Do I understand my content?
Generally, creating a taxonomy only works if you understand
the content you intend to classify. A review of the content to be classified –
not just document titles, but some amount of actual content, along with associated
metadata, should be conducted as part of the taxonomy creation process.
If multiple content sources with multiple types of documents
and intents are to be classified, then a sample from each must be reviewed in
order to determine how its specific content might affect the outcome of the
classification process. There may also be cases where certain file types, such
as image-format PDFs or encrypted data, can’t be read successfully by
text-oriented classification software. Document language must also be taken
into account, since automated classification software must be trained on a
It’s also necessary to consult appropriate internal authorities,
such as legal advisors and regulatory affairs personnel, in order to determine
how long various document types must be retained. While questions such as these
are more directly related to retention and file policies, they’re also relevant
to automated document classification. Certain document types may contain specific
terms and phrases that the software can be configured to search for, resulting
in higher confidence levels when performing classification tasks.
What’s the goal?
This question must obviously be asked before undertaking any
I.T. related project, since the cost and effort must be justified by a
measurable return on investment. The business case for automated content
classification depends on the industry, current practice, and the desired
outcome. Do you need to consolidate content sources as the result of an
acquisition or merger? Are regulatory needs driving the requirement for
efficient, legally defensible document management practices? Is your email
server laboring under the burden of 10 years worth of potentially useless
Done correctly, an automated classification project can offer
a solid ROI in a fairly short period of time. Lower storage and infrastructure
costs, easier access to relevant data, and less exposure to litigation-related
issues are obvious benefits that can justify the time and expense involved. Tasks
such as taxonomy creation and an initial document review generally should be
performed in advance if at all possible.
Doing so will help ensure success while preserving schedules and keeping
implementation costs to a minimum.