Making Key Words Work Smarter, Not Harder
by Jeff Fehrmann
Have you ever seen A&E’s reality TV show, “Storage Wars”? The concept is simple. After viewing the contents of a repossessed storage unit for five minutes from the unit’s door, professional buyers take a guess at what’s inside and bid to win the contents of the unit with the hope of discovering valuable goods buried deep within. Sometimes their arbitrary bids pay off, sometimes they don’t.
The way most companies conduct keyword searches during eDiscovery is the legal equivalent of “Storage Wars.” Based on little information, a preliminary list of keywords is used to launch a costly discovery process that ends with a review set that includes more data than is needed and may or may not have the most relevant information included.
Now imagine this…the “Storage Wars” bidder gets unlimited access to thoroughly explore the storage unit and conduct preliminary searches to gauge the value of the items within, and – using that information – come up with a well-researched, informed bid. The bidder hits pay dirt every time.
Is there a keyword search equivalent? Absolutely. The ability to see and analyze the data early in the process to make the informed, cost-effective decisions described in scenario #2 is now possible. New advances in technology over the last few years – which allow you to actually look at your data before you run any search terms and make cost and resource estimates – enable you to uncover more relevant and smaller data sets, save money, and protect you from potential legal exposure.
Keyword Search isn’t Dead; it Just Needs a Fresh Perspective
Keyword searching has been around almost as long as computers – and Boolean since 1854 when George Boole invented binary algebra, now called Boolean logic. And though it’s still the most common technique used in legal discovery to locate potentially relevant data, numerous studies and articles by both the bench and experts in the industry have questioned its effectiveness.
As early as 1985, the Blair and Maron study1 found that, though reviewers believed that their keywords had identified 75 percent of the relevant documents, in reality, they had uncovered only 20 percent. In a more recent 2009 Text Retrieval Conference (TREC) study2, keyword searches were found to be even less effective, with only 9 percent of the documents deemed relevant.
There have been some prominent judicial opinions that take a negative view on “blind” keyword searches.But many eDiscovery experts agree that keyword searches that follow The Sedona Conference® Best Practices Commentary on the Use of Search and Retrieval Methods in E-Discovery3 (The Sedona Conference® Best Practices) are effective when they involve some combination of testing, sampling, and iterative feedback.
The root of the problem isn’t the tool itself, but the carpenter. Keyword search has been given a bad rap because it’s been applied with a black box mentality and has been misused and misunderstood by service providers and consultants. Search is a process and keyword searching is just one tool within this process, best used in combination with such things as concept search, clustering, email threading, and categorization – all of
which are most effective with human involvement. These missteps generate unnecessary expenses and are starting to come under the scrutiny of the courts.
Excess, Incomplete Data
A traditional keyword search is a great starting point for finding things that exist, but is ineffective at finding things that deviate slightly and things that are unknown. Search returns are often bloated by false positives and leave behind many of the false negatives that typically go unreviewed. Because a typical keyword search returns only exact matches to the search criteria, many documents that are returned are completely irrelevant.
A search for the keyword “apple” could produce documents pertaining to the company, the fruit, or perhaps Chris Martin and Gwyneth Paltrow’s firstborn. Just as problematic, searches can also miss key data if a keyword does not find an exact match in the searched documents. This is the case with misspellings, variations of search terms, or different wording for concepts relevant to the case.
Not only are current-day searches ineffective, they’re also costly. According to a 2012 Rand Corporation Study6 the review stage of eDiscovery consumes about 73 cents of every dollar spent on electronically stored information (ESI) production, while collection and processing consume about 8 cents and 19 cents, respectively. Reducing the amount of data that’s moved to review would greatly reduce the high costs associated with review.
The typical back-and-forth process of developing and evaluating keyword lists over and over again also wastes time and money. Each iteration and new search can ratchet up the price and lengthen the timeline. In addition, searches that return false positives and overlook false negatives can inflate the number of documents that move to review. What isn’t known could really cost you.
Courts Want Transparency
More and more, the courts are starting to pay attention to how search terms are developed and used. As evidenced by case law (see Case in Point), the courts have growing concerns about the integrity and accuracy of keywords because the selection process, as it stands, is manual and based on limited insight and understanding of the documents in question. Ralph Losey, lawyer and author of the e-Discovery Team blog, refers to this process as “Go Fish”7 information retrieval.
The plaintiffs in Kleen Products v. Packaging Corp. of America 7 (No. 10-5711) recently withdrew their demand that defendants apply a predictive coding strategy and agreed to apply their ESI search methodology based on an iterative keyword strategy. In addition, alternative search methodologies may be revisited for any document productions after October 1, 2013. There are currently three other matters in which predictive coding or technology assisted review are in the spotlight that could have additional impact on the expectations of the courts.
Aside from the courts, drafting search terms and agreeing to them with opposing counsel before ever looking at the data is doing law firms and clients a disservice.
What’s Gone Wrong
Keyword searching can be effective for identifying a starting point for finding things that exist within the data set. But it’s failing because it’s being used as a stand-alone tool to identify what to move from processing to review without validating results.
Putting the Cart before the Horse
The typical scenario goes like this: attorneys develop a list of keywords that they think might be contained within the collection developed through the custodian interview process, they run the terms against the ingested data, and – instead of looking at the documents – they review the limited information provided by search reports to evaluate which documents are responsive. Like a volley, the keywords are then refined and lobbed back over to be run again, back and forth, iteration after iteration, until everyone is comfortable with the counts.
The problem is that review is focused on what is already known or believed to be known and the whole process doesn’t actually involve the content of the data. Decisions are made by looking at percentages on a spreadsheet. To put this in perspective, imagine running a Google search and instead of pulling up results with snippets, hit highlights, and Web page addresses, you simply receive the total number of pages that hit on the keyword searches that were run. This is essentially what happens with the current discovery search process. It provides no context, no details as to why the searches are ineffective, no understanding of how to refine the searches, and no indication of other areas that should be explored.
There is technology that exists today that makes it affordable and possible to dive into the data earlier in the process and then develop keywords to carve up the data and prioritize what gets looked at, when. Preliminary searches around the basics of the case and the key timeframes and participants should then drive keyword development, not the other way around. When a subset of the data is investigated, building and refining a list of keywords is easier and a much more fruitful exercise.
The process of developing keywords first and searching second is also problematic in the case of custodians. As shown in the Blair and Maron study, believing you’ve found 75 percent of what you know results in finding 20 percent of what is actually relevant. An effective use of keywords would start by interviewing key personnel, loading data while developing searches from information gathered in interviews, investigating results to find additional points of interest and low recall, refining, and then re-interviewing as needed to understand and uncover more – then repeating if necessary. Using the data to direct custodian interviews, as opposed to letting custodians point you to where the data might live, will help you uncover more valuable and useful information and help to limit surprises during the review phase.
Treating eDiscovery Search as Enterprise Search
As noted by Kamal Shah in his article, “Enterprise Search vs. E-Discovery Search: Same or Different?”8,enterprise search is based on speed and simplicity, which is designed to process a single search query and deliver results for that particular search. As in the case of Google, the search engine executes the query and the user isn’t privy to what the engine has actually searched for. In eDiscovery, speed and simplicity are also important, but in addition – as Judge Grimm stated in the Victor Stanley case – users have to be able to show how their searches were executed.
Searching with Blinders On
In discovery, people tend to begin a search thinking they know what they’re searching for and often times don’t deviate from their original assumptions. Keyword searching has traditionally been a culling process to remove nonresponsive documents and has been treated as a linear progression, with search terms being used to narrow down what is relevant to the case. With these approaches, key themes, documents and even potential custodians can get overlooked.
Search terms should be used to target what you know and investigation of those results should be used to uncover what you don’t know. Knowledge is power and it’s hard to expand your knowledge if you don’t research the things you don’t know. Keywords should guide the direction of your investigation, not limit it.
Broadening the scope helps identify things such as sidebar conversations that might not get picked up by blind searches. The themes contained within a data set can go in many directions that all get to the same conclusion – how can you reasonably find the themes when you are only chasing one of them?
A New Approach
In addition to fixing some of the problems that are currently hindering search effectiveness, there are other changes and technological solutions that would greatly improve the ways in which keyword searches are conducted and employed.
Search Needs Technology and a Human Touch
Keyword searching should be a blended, iterative approach that combines both humans and technology – never only one or the other. As the Blair and Maron study demonstrates, a computer can’t distinguish relevant cases from irrelevant cases by searching on the full text of a case. Once a person understands the case at hand and the language in the documents, then technology can be applied. A computer can’t interpret the case.
Investigate Data Early and Often
Spending more time at the beginning of the process to fully understand the data helps ensure that the data you’re moving into review is most likely to be relevant. Search is not just about which documents should be produced – early analysis can provide you with insight into the case itself.
The more you investigate the data on the front end, the more you can uncover, understand, plan, and make informed decisions throughout the process. Testing search terms early gives you a feel for what’s relevant and what’s not and will inform your direction. You’ll start to uncover documents and individuals who weren’t necessarily on your radar screen, and common challenges, such as uncovering a key custodian at the eleventh hour of review, can potentially be avoided.
Start with what is known and expand out from there. There’s always information with which to start – use that to point you in the right direction and help develop the search terms. Take a few established things (key time periods of interest, custodians who would have had direct involvement in the matter, the basics of what the litigation is about) to set the direction of the case and go where the data points you. Pick a term to use as a starting point and search through the metadata: file names, email subjects and other areas where custodians are more likely to use the standard language to talk about the subject of interest. From there, you can look into the documents to determine people who were referenced in the document and other relevant materials.
In the Victor Stanley case, Judge Grimm cites The Sedona Conference® Best Practices as a source of best practices stating, “In this regard, compliance with The Sedona Conference® Best Practices for use of search and information retrieval will go a long way toward convincing the court that the method chosen was reasonable and reliable.”
The Sedona Conference® Best Practices document suggests that key components of an effective search methodology include:
• Testing searches to identify whether they are producing over - or under-inclusive - or under-inclusive results.
• Sampling documents determined to be privileged or not to arrive at a comfort level that the categories are neither over-inclusive nor under-inclusive.
• Getting iterative feedback by refining searches based on the testing results and validating the refinements.
Find the Right Solution
Look for a solution that:
• Allows you to weed out false positives – To avoid inclusion of irrelevant files, you need the ability to detect variations in the data. A solution that identifies uniqueness and outliers will help pinpoint issues with particular terms.
• Lets you establish workflows and replicate exercises – Data tells a story. As you add more data, that story can veer in many directions. You need a solution that can help you retrace your steps, easily apply what you learned in a dynamic fashion, and sample areas of the population that have gone unexplored. Define an iterative process and continue to re-visit until there is comfort that a reasonable effort has been made to uncover the documents relevant to the request with documentation to show what was done and steps that were taken to validate.
• Has flexible and faceted search capabilities – This allows keywords to be removed, expanded, stemmed, and modified iteratively during processing. It provides the ability to apply, evaluate and organize searches, as well as group and classify sets of documents that are isolated and comment as to why they’re important.
• Provides detailed reporting throughout the process – Instead of producing one final search report, look for a tool that generates details about data early and often.
• Gives you the ability to transfer knowledge and background to other involved parties – Make search term development and history transparent to people further downstream in the process, including opposing counsel and the courts, if needed.
• Offers users Web-based access to the data – This gives users the ability to receive reports and dive deeper into data directly from their browsers.
The Net Net
Keyword searching is currently a powerful but misused tool. Properly used, It can produce less – but more applicable – data in a more cost-effective way. Re-examining how keywords are applied and choosing the right tool for the job are vital to making that happen. It’s not just a tool to whittle down data, but a way to investigate data to discover more and review less. Taking the time up-front to assemble and examine the data to develop and iterate on keywords saves time and money further downstream. This measure-twice-cut-once strategy is a reliable way for legal departments and law firms to develop more effective and predictable budgets. And providing transparency into the process of developing keywords will minimize legal exposure and meet the expectations of the courts.
1 David Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System, (1985)
2 F. Zhao, D. Oard and J. Baron, Improving Search Effectiveness in the Legal E-Discovery Process Using Relevance Feedback, (June 2009)
3 The Sedona Conference®, Best Practices Commentary on Search and Retrieval Methods, (August 2007)
4 Philip Favro, Mission Impossible? The eDiscovery Implications of the ABA’s New Ethics Rules, (August 2012)
5 Rand Corporation, Where the Money Goes, (2012)
6 Ralph Losey, Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search, (2009)
7 Milton I. Shadur, Kleen Products, LLC, et al v. Packaging Corporation of America, (April 2011)
8 Kamal Shah, Enterprise Search vs. E-Discovery Search: Same or Different?
About the Author
Jeff Fehrman is Chief Strategy Officer of Mindseye Solutions. Mr. Fehrman spent 3 years at Integreon, where he provided consulting and advisory services to law firms, corporations, and government clients throughout the eDiscovery lifecycle. Prior to his tenure at Integreon, Mr. Fehrman spent 7 years as a vice president with ONSITE3. He’s currently on the Board of Governors for the Organization of Legal Professionals (OLP) and has been an active member in the Sedona Conference’s Working Group on Electronic Document Retention and Production since 2006 as well as the co-founder of EDD Blog Online. He is based in Arlington, VA.