The Classification World
Happy New Year.
With the new year, I've decided to shift my writing and publishing efforts over to a new blog.
I've entitled that blog "(un)structured". So for all you loyal "Classification World" readers (hi mom!), mozy on over to my new blog, where I'll not only be covering content classification topics, but other topics in content analytics, the intersection of structured and unstructured information and other issues in ECM and information management.
Of course I'll continue to use twitter (@joshpayne) and look forward to more writing and interaction in 2010.
John Mancini from AIIM has been running a series on enterprise content management on his blog over the past few months. He's canvassed for guest authors to provide entries, following the theme "8 things you need to know about . . .". I contributed a blog posting on content classification.
John has collated some of these contributions into two e-books. The second e-book was released earlier this week, on the topic of records management implementation. Its insightful, timely and best of all . . . free.
I encourage you to check them out. And contribute your own "8 things" entry to John.
The Information on Demand (IOD) conference, the last week of October, was an exciting week for me personally. At the conference, we made some new product announcements on projects upon which I’ve been working pretty hard. (and as such its taken me this long to peek out from the resulting pile-up of work to comment here). It was gratifying to see a lot of these ideas really spread beyond the virtual walls of IBM.
Today, I’ll give color on one of the announcements – InfoSphere Content Assessment.
I’ve been thinking about this concept of content assessment as part of my classification work for the past year or so. In my travels, when I introduced the idea of “content classification” to enterprise content customers, I found that the phrase could mean very different things to different people – and often times different from what I thought it meant.
I was frequently framing content classification as an automated tool for taking action – classifying content as part of the process of organizing unstructured information. Classification as part of the email archiving process. Classification as a tool for executing records management.
Yet many of my customers would hear about content classification and conceive of it as a tool for understanding what they have, divorced from the idea of taking immediate action (at least taking action in the short term). These customers knew they had a lot of information in an unmanaged state. They knew it posed some level of risk to their organization, but they weren’t sure what type of risk. They weren’t sure where the risk was the highest. They weren’t sure where the ROI for tackling the risk laid and where the ROI was weak.
These customers wanted to assess the state of their “content in the wild” and figure out their action plans. Their IT groups are great at telling them the hard stats based on information outside the content – the amount of disk in the enterprise, the utilization of that disk, etc. But IT couldn’t tell them what lay within those millions of files. How do they know what content needs to be saved before they can decommission the system? Where do we start our records program? How can we assess the potential ROI for a records program? All these questions beg for insight that only the insides of the content residing in the enterprise can give. Content classification and, more broadly, content analytics provide this kind of insight to help answer these bigger picture questions as organizations seek to gain better control over their content.
Customers were frequently saying to me “I have 10 Terabytes of content, unmanaged. I don’t know what I should do. I don’t know what I can do with it. Its too much to sift through manually.”
InfoSphere Content Assessment is designed to help those customers.
That is the short introduction on how I came to realize the importance of content assessment for organizations. Others who were at the conference have written up their own impressions of Content Assessment, including Brian Hill of Forrester and Fern Halper of Hurwitz Associates.
I’m stuck in the JFK airport, on my way to the IOD 2009 conference so I’ll take this chance to justify my lack of blog posts over the last few weeks.
It’s been a very busy October. Let give some quick observations on what I’ve done so far and what I’m doing the rest of the month.
ARMA 2009 - Orlando
I spent a chunk of October preparing for and attending the ARMA conference in
I also taught an education class on automated classification. It was a combination of material I use every day and new material I created for the class. I think it was a success. The attendees for each session were engaged and asked good questions. I was a little nervous about it, going in, but was pretty pleased with the outcome. I think I’ll be trying to reach out to local ARMA chapters to repeat it, on a smaller scale in other locations. Let me know if you’re interested.
It was my first ARMA conference where I ventured out from the expo floor. Generally I found the attendees active and open, making the most of their time at the conference.
IOD 2009 –
Finally, I’m on my way to Information on Deman (IOD) 2009 in
Session TCM-2948: “Demystifying Classification” is all about demos. I’m planning on providing demonstrations of content classification and the InfoSphere Classification Module in four different scenarios.
Session TCM-2943: “Content Classification: Critical to Assessment, Collection & Archiving” is an overview talk covering applications of advanced content classification.
There are also hands on labs for InfoSphere Classification Module (HOL-1192, Wednesday) and a technical product update (TCM-2455, Tuesday), reviewing many of the improvements we’ve made to the InfoSphere Classification Module.
Lastly, I’m presenting TCM-2947 “Delivering Trusted Information from IBM ECM: True Single View Applications” on Wednesday morning. It’s a bit of a departure from my classification work, but is peripherally related. If you’re interested in single-view type applications and how ECM can work with them, I encourage you to join me on Wednesday. I’ll be writing more about the topics covered in this session in the coming weeks.
See you in Vegas!
Typically when I make the case for automating email classification decisions, I provide examples of employee behavior that appeal to my inherent personal sense of chaos. People are distracted. People are busy. People have higher priorities. Due to information overload, they're unlikely to participate or comply with your email regulations.
The employee (Kineavy) deleted his email every day, before City's rudimentary archiving backup kicked in. Correctly, the common refrain within the compliance community has been that the city's IT and records management approach was flawed and its yet another example of an organization that needs to get its act together. But what caught my eye about this and gave me a chuckle was that (purposely or not) the employee was skirting compliance obligations because he was too conscientious when it came to managing his email. I'd never really thought about how employees might be unreliable when it comes to abiding by email compliance dictates . . . because they could be too good about managing their inbox.
Last winter, the TV program Frontline helped cut through the noise and confusion of the economic recession. They aired an episode entitled “Inside the Meltdown” that explained the happenings in a clear manner. The episode focused on the concept of moral hazard – the simple idea that a party “insulated from risk may behave differently form the way it would behave if it were fully exposed.” If you know someone is going to back you up, you’re more likely to take risks because the downside of those risks is softened.
was reminded of this concept a few weeks ago when I spoke to a gathering of ECM
customers and partners in
My view on this is that end-users are unlikely to participate for a host of reasons – that’s one of the main drivers for incorporating automated classification in the first place. So to worry about reduced participation at that point . . . is like worrying about closing the barn door after the horses have gotten out.
The other incentives to participate (or not), are more likely to determine users’ behaviors. Things like knowledge on the topic, level of distraction and personal stake in a positive outcome.
Regardless, I found the link to moral hazard interesting, and hope you do too.
What do you think?
I’ve had a chance to talk to more customers and business partners over the last few weeks and months about classification. Frequently these customers ask me “How long is it going to take for me to get up and running?” Or if they’ve learned more about our product, they ask “It seems like its going to take a lot of effort to train the system?”
Well educated, motivated knowledge workers don’t want to be forced to take on a dreary task like dividing documents into categories. And they see some in the future with a classification training process.
Typically, I provide a boring “It depends” answer, and dive into the variables that come into play. In the end, there’s a big range, depending on how well your business policies are defined and in turn captured in your business applications like records management systems. I’ll post more on the topic in the future. We can automate this process and make the classification training process less burdensome itself. But bottom line, there’s going to be some necessary dirty work.
But my colleague, Michele Kersey, provided a great answer to a customer this week on this question that cut to the heart of the matter. The quote that stuck out in my mind was “Content classification scales out the human element.”
The phrase struck me for two reasons. First, it nicely encapsulates the difference between rule based classification and the more advanced, context-based classification methods. Rule based methods force humans to try to classify content under the constraints of binary logical. Lots of IF-THEN statements. Sure, computer science majors can think that way, but John in marketing and Jane in HR don’t think in that manner. Embarking on rule based classification project exposes the classic IT/LOB gap.
By taking an approach to content classification that relies on a training-based approach (using example documents to “teach” a classification engine), training the system with samples provided and/or validated by those business stakeholders, you are encapsulating the human element in your classification logic. A human has written the documents that are being used to train the system. And a human is choosing these “typical” documents.
This training process takes effort, but the scale and scope of that effort pales in comparison to the effort you’d need to harness otherwise – which is the second reason I liked Michele’s quote. Automated classification using training based methods scales out the effort your organization has put into training the system. Yes, it takes effort to train a classification system, but you’ll earn back savings on that effort in the following weeks, months and years that the human-based logic is applied and re-applied to answering categorization questions that your workers would otherwise need to handle.
Train the system once, with a small set of workers, and the same work that those workers had to do manually, will be executed automatically over and over and over again throughout your organization. Do some work to plant the seed of classification and watch it eliminate repititive tasks throughout your organization.
Joshua Payne 060000XYQN firstname.lastname@example.org Tags:  precision_and_recall ecm classification 2,288 Visits
Precision and recall. It’s a topic in that is frequently misunderstood in the search and classification market. It took me a couple of years hanging around the PhD's to get my head around the concepts. Every time I got confused, I'd check wikipedia and I'd just get more confused because the wikipedia entries on precision and recall seem to have been written by the PhD's, for the PhD's. Even my own products’ documentation confused me. False negatives? False Positives? I just got it all mixed up.
When it comes to one hour introductions to classification, this is one of those topics that gets hyper-simplified under the banner of "accuracy". There's so much other new technology and application topics that need to be covered that we just don't pick this battle when we talk about content classification to customers. But this blog is close to a year old and this topic has come up not once, but three times in the last week or so with customers and business partners. So I figure its time to address it. The audience is ready (I hope).
Where should we start? Ok. Let’s start with the topic of perfection. Now I hope this doesn't come as a shock to you, but these automated methods of classification that I've been touting over the last year or so on this blog and on twitter (and in embarrassing videos) aren't perfect. They’re not perfect? Shocking, I know. But automated classification doesn't get it right 100% of the time. There are two ways to handle the fact that automated classification isn't going to get it right 100% of the time. One approach is to . . .
1) Be exclusive. Be a snob. Only accept the best results. Only accept results from automated classification that the system is highly confident are accurate. If we're not close to certain that this is the right answer, then don't accept it. Being highly exclusive like this is what is meant by having high precision. If you insist on high precision, you won’t categorize all your content automatically (you’ll skip over a lot of the content and leave it uncategorized), but the ones you do categorize, you'll do so with reasonably good success.
The other approach is to . . .
2) Be inclusive. Be a populist. Accept the best answer (or answers), no matter what. Use the top classification guesses despite low levels of confidence in their correctness. What's the impact of this? You'll maximize the gross number of right answers that you get, but it comes at the cost of getting a bunch more wrong. You'll be handling as much as possible automatically, you'll be getting the highest number of answers right, but you'll also be getting more answers wrong. Being highly inclusive means you have high recall.
In real life applications of classification (or search), you can't have it both ways. Everyone wants to have high precision (our answers are always right!) with high recall (we answer every question correctly) . . . but unfortunately that's not realistic. Even our best methods of classification are imperfect. An organization typically needs to make a decision as to how to balance the two factors of precision and recall. Is it more important to try to get more answers right? Or is it more important to have the answers you do provide be correct? The more answers you get right, the more answers you're also going to get wrong.
For the graphically inclined, tradeoff can be visualized as follows:
The further you move to the right, the lower your precision gets and your recall improves. Organizations using automated classification need to determine what their curve looks like . . .and then determine what point on the curve is right for them.
How do we handle this? With any tradeoff decision, you as an organization need to determine where your priority lies. How are you using classification? For what purpose? What is the impact of the automation?
In some situations, a bias towards high recall is appropriate. For example, I might be using automated classification to populate navigation options for users as they attempt to find content. The user might not expect perfect documents with each navigation option and as such, it’s worth the trade-off to have more classifications executed.
On the other hand, I might be using classification to determine what content has business value and what content does not have business value. I might be using automation to slice off the content that I'm highly confident doesn't belong in my ECM repository to reduce down-stream costs. In this case, high precision might be warranted. I want to only get rid of the content I’m reasonably certain I can excise.*
*These are two simple examples. Don't take them as gospel advice. Analyze your own situation carefully
Once you've thought through your policy on making the precision/recall trade-off, there are various "switches" in your automated classification deployment you can use to put a strategy into action. The two biggest ones that come to mind are:
1) Confidence Threshold. Each classification response, with advanced methods, typically comes with some sort of confidence score for the classification response. If you want high precision, you can set the confidence level at which you accept responses for categories to be very high. If you want high recall, set the confidence level at which you will accept categorization suggestions very low.
2) Number of classification responses to accept. A simple way to increase your recall is to expand the number of classification suggestions you incorporate. With advanced methods, you are asking the automated classification capability to assess similarity. Therefore the classification engine is assessing the similarity to every category upon which it has been trained. Rather than simply accepting the most similar category, you can decide to take the top two suggestions (or three . . . or four).
For example, a recent IBM customer assessed the applicability of automated classification for automatic assignment of the records management disposition policy for each email they archive. The customer realized that certain types of emails were being inaccurately classified due to how they trained the system. Their examples for two categories were very similar and overlapping. When they dug in, they realized that the training set had given mixed guidance to the classification engine -- it was an instance in which the business stake holders themselves (the records managers) had frequently clashed on how to proper classify a certain set of content . To resolve the situation (before the advent of automated classification) they had made the policy decision to apply both classifications to the content. As a result, they decided to carry forward the same business policy to their automated classification. Their policy decision implicitly impacted their recall.
Some readers might blanch at the thought of applying this kind of practice across the board. It might not be right to create universal rules to control precision and recall. But you can exercise these two controls on your precision and recall in more focused ways -- building a set of rules (based on your business policies) control precision and recall for specific categories. For example, you might configure an extra rule in your system to accept the second suggestion if and only if it’s a particular category.
It’s important to define your business policies as they will impact your technical decisions around precision and recall. If you're assessing and investigating automated classification products and technologies for use in your scenario, look into the tools that the vendor provides to help make this trade-off decision. InfoSphere Classification Module has a set of reports and workflows for making informed precision/recall trade-offs. Expect them from your own classification tools because this is a critical set of technical and business decisions that will impact the success of your classification project.
And if you have your favorite precision/recall explanation – share it in the comments. I’m sure there are better ones out there.
Another year, another release! I am happy to announce that our advanced content classification product will issue its next release later this week. Version 8.7 comes out on August 20 to be exact. The formal announcement letter that IBM makes me do is no fun – and press releases are so pre-web 2.0 – so let’s take a run-through of some of the new changes, features and improvements we’ve lined up for the product.
The first change of note is the name. We’ve added the InfoSphere brand to the name so the product is now known as IBM InfoSphere Classification Module. The InfoSphere brand is all about trusted information. Content analysis and organization is a key element of enhancing content to so it becomes trusted business information.
Of course we’ve made a lot of improvements to the software beyond simply changing the name. The overall emphasis of this release was to continue our focus of improving the Classification Module’s ability to provide advanced classification for our Compliant Information Management customers.
A quick bulleted list of some of the features we’ve added:
Let’s dive into some of the details.
IBM InfoSphere Content Collector Integration Improvements
Significant improvements are delivered to its core product to facilitate the use of the Decision Plan functionality by IBM Content Collector. Introduced in V8.6 (released in 2008), Decision Plans provide for two major elements of classification functionality:
V8.7 provides functionality that will allow future versions of IBM Content Collector to efficiently utilize Decision Plans. With Decision Plans, Content Collector users of the Classification Module will have access to a full range of classification analysis methods, improving classification accuracy in content collection and archival scenarios.
Decision Plan improvements
In V8.7, Decision Plans themselves have been enhanced as well.
First, the pattern matching functionality has been enhanced. In prior versions, Classification Module customers have been able to define patterns for identification and subsequent extraction from long form text. This pattern extraction capability is one of the rules-based methods of classification provided by Decision Plans to identify and extract patterns such as account identifiers and customer identifiers.
In V8.7, users of the Classification Workbench can define these pattern matching and extraction rules using the full, standard, regular expression syntax with which IT users are familiar and trained.
The Decision Plan functionality has been modified such that it is now extensible. Do you require a custom method of classification analysis in addition to those provided by the InfoSphere Classification Module? Or do you want the InfoSphere Classification Module to validate its assignments with an outside, trusted source of information? The open API now included with the InfoSphere Classification Module allows for such customizations by IBM customers and Business Partners. The Decision Plan functionality now provides call-outs to allow for custom programs to analyze the filtered documents and any categories already defined by the Classification Module.
In V8.7 of the InfoSphere Classification Module, the
If you wish to use the product with other languages, the InfoSphere Classification Module provides a generic language processing option.
Workbench. The workbench administration tool now has improved usability providing for a more intuitive navigation experience via a tabbed navigation screen paradigm. Uninitiated users will find it easier to execute tasks more quickly. I can personally attest to its improved usability. I’ve been using it with great regularity over the past two months and have little to complain about.
Sample projects and tutorials. New sample projects have been added to help users begin classifying content more rapidly. With the new tutorials, you can explore the capabilities of the product on your own, at your own pace.
Five Random observations regarding the new Cohasset Associates Whitepaper, "Meet the Content Tsunami Head On: Leveraging Classification for Compliant Information Management".
1) All due respect to John Mancini and the "Digital Landfill", I like the tsunami metaphor Cohasset uses. Plus it makes for a nicer theme to weave throughout the paper (as compared to a them based on the garbage of a landfill).
2) The paper describes records management as representing the "organizational memory" for enterprises. As someone who was new to records management 2+ years ago, that nicely turned phrase resonated with me. It makes for a tidy justification, in a few words.
3) The abstract mentions, when discussing automated classificationclassificaiton, the need for "best practices and technologies that scale and adapt to meet the information governance challenges at hand." The need for "scalable" policies is a succinct way of justifying automated classification. Another tidy justification.
4) On the topic of scaling, "legacy practices from the paper world simply do not scale" is another idea that reminds me of some recent customer interactions (and discussions with my IBM colleagues). Too often customers are saying that yes indeed, they have a file plan but it just hasn't been transferred over to their electronic records management needs. Maturation of record keeping and archival practices just hasn't kept pace with the rate of innovation when it comes to creating electronic content.
5) I like a bunch of the stats that Cohasset was able to pull together, like the cost of classifying departed employees' content . . . or the comparison of human classification to automated classification.
I don't do full justice to the paper. Check out this classification whitepaper today if you're interested in the topic.
1. I recently completed a webcast with my colleagues Frank McGovern on the topic of content classification and records management. The webcast was an ARMA event. There was good live attendance with some nice questions at the end. Check out the recording if you have a free hour this week.
2. My IBM colleagues responsible for the eDiscovery product suite recently launched a new version of their offering. Check out the eDiscovery homepage on ibm.com for more information, including a nice demo.
My parents came to dinner last night. Wracked with guilt over not having posted to this blog in close to a month (sorry folks, vacation and then a crazy post vacation deadline for a new project), I was trying to brainstorm on new ideas for the blog when I recalled a memorable (at least for me) vignette involving my dad.
I was probably 17 years old and had been dispatched to pick
up my dad at his office. He had been off on a long business trip and had just
returned directly from the airport to his office without a car. He needed
something from his office. As a newly minted driver, I went to pick him up but
he had yet to find whatever it is he needed when I got there. I sat in his
office as he rifled through his desk looking for the critical paper or object
or whatever it was (my father isn’t the most organized person, his office was a
mess – similar to the desk at which I’m writing this). While he rummaged about,
he was listening to the voicemails that had built up while he was gone. Corporate
voicemail systems were a relatively new development. It was like the voicemails
were background music to him. He was barely paying attention to them. Finally,
I piped up.
“Dad, aren’t you going to write these messages down?”
He looked up from his search and said “Nah, if any of these
are really important, they’ll call me back.”
It was a lesson that informs me to this day. I don’t get too
upset over the horrible state of my inbox after a vacation or time away from
diligent inbox management – the important stuff pops back up to the top through
the persistence of the truly motivated colleague.
The same dynamic is at play today with our archiving obligations around email and all the other manners of communication and collaboration. The amount of information being pushed on each of us mushrooms every year. And our ability and willingness to process it, let alone fulfill compliance obligations around it cannot keep up.
My dad could barely find the time to listen to his voicemails,
let alone follow up on them close to 20 years ago. Now you want him to
carefully file each email with business value? Without any automated help? It’s
a laughable proposition (if you know my dad). But he’s not so different from
all of us. Many employees are going to make the same kind of trade-off
decision. The incentive just isn’t there to force them to handle classification and organization
of their information. They need help.
Joshua Payne 060000XYQN email@example.com Tags:  email classification ecm aiim compliance 606 Visits
John Mancini, the president of AIIM, keeps a blog called "Digital Landfill." For the last week or two he's opened up his blog to guest authors, each following the format "8 Things You Need to Know . . ."
Today, I took a guest turn, writing "8 Things You Need to Know about Content Classification and ECM." Check it out.
For fun, I plugged in the RSS feed for this blog to a tag cloud creating utility (www.wordle.net) to see what it looked like. Not a shock with respect to the most prominent word.
created using www.wordle.net
On June 24, at 1 PM EDT, I'll be participating in a webinar focused on providing a demo of the IBM Classification Module. We'll be showing how the Classification Module can determine the business value of an email (Is it a personal email? Is it a business email? Should it be archived?) as well as showing how the Classification Module can make a similar assessment of file system documents to re-organize them.
This demo is a part of an ongoing series of webinars hosted by IBM ECM. Sign up for the classification demo and any other upcoming webinar in the series at https://www.filenetinfo.com/mk/get/proddemos