As any I.T. veteran knows, management of unstructured data has become increasingly difficult over the years. Web pages, PDF files, Office documents, and email messages can (and do) accumulate within file systems and other repositories at an alarming rate, consuming storage and other resources. Some organizations adopted a ‘save everything’ model that has resulted in huge file shares or email archives that likely contain only a small percentage of usable data. Finding files or messages in these archives can be nearly impossible, especially in situations where unmanaged repositories or departmental file shares are involved. Additionally, this storage model can result in legal headaches if a lawsuit or other action results in a demand to produce all documents related to a given case. Searching vast archives of potentially relevant materials can consume significant resources over a long period of time.
Automated content classification can help mitigate such problems, but groundwork and planning, as well as a solid understanding of the content to be classified and how it can be logically divided into various categories, are needed in order to insure success. As a good starting point for this process, consider the following questions.
What’s my taxonomy?
You can’t categorize documents, or anything else for that matter, without a coherent list of known categories and criteria that distinguishes one from the others. This list, along with the characteristics of each element, is known as a taxonomy, and most people make use of them in everyday life without even knowing it. We instinctively know the difference between a laptop and desktop computer, and most people can articulate what those differences are with relative ease.
The same is true when documents are involved. What’s the difference, for instance, between an “Accounting and Finance” document and one from “Engineering”? Are there key phrases, terms, and intents that could help an employee distinguish one from the other with a reasonable level of confidence? If the answer is yes, then it is likely that software such as IBM Content Classification™ will be able to distinguish one from the other once it has been trained to recognize those differences.
Certain categories may be more problematic: “Legal” and “Regulatory” may involve significant overlap of intent and language, for instance. The rule of thumb is simple. If a human can’t classify documents into selected categories with a high level of certainty, then a computer won’t be able to either. It’s a simple as that.
Do I understand my content?
Generally, creating a taxonomy only works if you understand the content you intend to classify. A review of the content to be classified – not just document titles, but some amount of actual content, along with associated metadata, should be conducted as part of the taxonomy creation process.
If multiple content sources with multiple types of documents and intents are to be classified, then a sample from each must be reviewed in order to determine how its specific content might affect the outcome of the classification process. There may also be cases where certain file types, such as image-format PDFs or encrypted data, can’t be read successfully by text-oriented classification software. Document language must also be taken into account, since automated classification software must be trained on a per-language basis.
It’s also necessary to consult appropriate internal authorities, such as legal advisors and regulatory affairs personnel, in order to determine how long various document types must be retained. While questions such as these are more directly related to retention and file policies, they’re also relevant to automated document classification. Certain document types may contain specific terms and phrases that the software can be configured to search for, resulting in higher confidence levels when performing classification tasks.
What’s the goal?
This question must obviously be asked before undertaking any I.T. related project, since the cost and effort must be justified by a measurable return on investment. The business case for automated content classification depends on the industry, current practice, and the desired outcome. Do you need to consolidate content sources as the result of an acquisition or merger? Are regulatory needs driving the requirement for efficient, legally defensible document management practices? Is your email server laboring under the burden of 10 years worth of potentially useless messages?
Done correctly, an automated classification project can offer a solid ROI in a fairly short period of time. Lower storage and infrastructure costs, easier access to relevant data, and less exposure to litigation-related issues are obvious benefits that can justify the time and expense involved. Tasks such as taxonomy creation and an initial document review generally should be performed in advance if at all possible. Doing so will help ensure success while preserving schedules and keeping implementation costs to a minimum.