Making Key Words Work Smarter, Not Harder
Have you ever seen A&E’s reality TV show, “Storage Wars”? The concept is simple. After viewing the
contents of a repossessed storage unit for five minutes from the unit’s door, professional buyers take a guess
at what’s inside and bid to win the contents of the unit with the hope of discovering valuable goods buried
deep within. Sometimes their arbitrary bids pay off, sometimes they don’t.
The way most companies conduct keyword searches during eDiscovery is the legal equivalent of “Storage
Wars.” Based on little information, a preliminary list of keywords is used to launch a costly discovery process
that ends with a review set that includes more data than is needed and may or may not have the most relevant
Now imagine this…the “Storage Wars” bidder gets unlimited access to thoroughly explore the storage unit
and conduct preliminary searches to gauge the value of the items within, and – using that information – come
up with a well-researched, informed bid. The bidder hits pay dirt every time.
Is there a keyword search equivalent? Absolutely. The ability to see and analyze the data early in the
process to make the informed, cost-effective decisions described in scenario #2 is now possible. New advances
in technology over the last few years – which allow you to actually look at your data before you run any search
terms and make cost and resource estimates – enable you to uncover more relevant and smaller data sets, save
money, and protect you from potential legal exposure.
Keyword Search isn’t Dead; it Just Needs a Fresh Perspective
Keyword searching has been around almost as long as computers – and Boolean since 1854 when George
Boole invented binary algebra, now called Boolean logic. And though it’s still the most common technique used
in legal discovery to locate potentially relevant data, numerous studies and articles by both the bench and
experts in the industry have questioned its effectiveness.
As early as 1985, the Blair and Maron study1 found that, though reviewers believed that their keywords had
identified 75 percent of the relevant documents, in reality, they had uncovered only 20 percent. In a more
recent 2009 Text Retrieval Conference (TREC) study2, keyword searches were found to be even less effective,
with only 9 percent of the documents deemed relevant.
There have been some prominent judicial opinions that take a negative view on “blind” keyword searches.
But many eDiscovery experts agree that keyword searches that follow The Sedona Conference®
Best Practices Commentary on the Use of Search and Retrieval Methods in E-Discovery3 (The Sedona
Conference® Best Practices) are effective when they involve some combination of testing, sampling, and
The root of the problem isn’t the tool itself, but the carpenter. Keyword search has been given a bad
rap because it’s been applied with a black box mentality and has been misused and misunderstood by service
providers and consultants. Search is a process and keyword searching is just one tool within this process, best
used in combination with such things as concept search, clustering, email threading, and categorization – all of
which are most effective with human involvement. These missteps generate unnecessary expenses and are
starting to come under the scrutiny of the courts.
Excess, Incomplete Data
A traditional keyword search is a great starting point for finding things that exist, but is ineffective at finding
things that deviate slightly and things that are unknown. Search returns are often bloated by false positives
and leave behind many of the false negatives that typically go unreviewed. Because a typical keyword search
returns only exact matches to the search criteria, many documents that are returned are completely irrelevant.
A search for the keyword “apple” could produce documents pertaining to the company, the fruit, or perhaps
Chris Martin and Gwyneth Paltrow’s firstborn. Just as problematic, searches can also miss key data if a
keyword does not find an exact match in the searched documents. This is the case with misspellings,
variations of search terms, or different wording for concepts relevant to the case.
Not only are current-day searches ineffective, they’re also costly. According to a 2012 Rand Corporation
Study6 the review stage of eDiscovery consumes about 73 cents of every dollar spent on electronically
stored information (ESI) production, while collection and processing consume about 8 cents and 19 cents,
respectively. Reducing the amount of data that’s moved to review would greatly reduce the high costs
associated with review.
The typical back-and-forth process of developing and evaluating keyword lists over and over again also
wastes time and money. Each iteration and new search can ratchet up the price and lengthen the timeline.
In addition, searches that return false positives and overlook false negatives can inflate the number of
documents that move to review. What isn’t known could really cost you.
Courts Want Transparency
More and more, the courts are starting to pay attention to how search terms are developed and used.
As evidenced by case law (see Case in Point), the courts have growing concerns about the integrity and
accuracy of keywords because the selection process, as it stands, is manual and based on limited insight and understanding of the documents in question. Ralph Losey, lawyer and author of the e-Discovery Team blog,
refers to this process as “Go Fish”7 information retrieval.
The plaintiffs in Kleen Products v. Packaging Corp. of America 7 (No. 10-5711) recently withdrew their
demand that defendants apply a predictive coding strategy and agreed to apply their ESI search methodology
based on an iterative keyword strategy. In addition, alternative search methodologies may be revisited for any
document productions after October 1, 2013. There are currently three other matters in which predictive coding
or technology assisted review are in the spotlight that could have additional impact on the expectations of the courts.
Aside from the courts, drafting search terms and agreeing to them with opposing counsel before ever looking
at the data is doing law firms and clients a disservice.
What’s Gone Wrong
Keyword searching can be effective for identifying a starting point for finding things that exist within the
data set. But it’s failing because it’s being used as a stand-alone tool to identify what to move from
processing to review without validating results.
Putting the Cart before the Horse
The typical scenario goes like this: attorneys develop a list of keywords that they think might be contained
within the collection developed through the custodian interview process, they run the terms against the
ingested data, and – instead of looking at the documents – they review the limited information provided by
search reports to evaluate which documents are responsive. Like a volley, the keywords are then refined and
lobbed back over to be run again, back and forth, iteration after iteration, until everyone is comfortable with
The problem is that review is focused on what is already known or believed to be known and the whole
process doesn’t actually involve the content of the data. Decisions are made by looking at percentages on
a spreadsheet. To put this in perspective, imagine running a Google search and instead of pulling up results
with snippets, hit highlights, and Web page addresses, you simply receive the total number of pages that hit
on the keyword searches that were run. This is essentially what happens with the current discovery search
process. It provides no context, no details as to why the searches are ineffective, no understanding of how
to refine the searches, and no indication of other areas that should be explored.
There is technology that exists today that makes it affordable and possible to dive into the data earlier in
the process and then develop keywords to carve up the data and prioritize what gets looked at, when. Preliminary searches around the basics of the case and the key timeframes and participants should then drive keyword
development, not the other way around. When a subset of the data is investigated, building and refining a
list of keywords is easier and a much more fruitful exercise.
The process of developing keywords first and searching second is also problematic in the case of custodians.
As shown in the Blair and Maron study, believing you’ve found 75 percent of what you know results in finding
20 percent of what is actually relevant. An effective use of keywords would start by interviewing key personnel,
loading data while developing searches from information gathered in interviews, investigating results to find
additional points of interest and low recall, refining, and then re-interviewing as needed to understand and uncover
more – then repeating if necessary. Using the data to direct custodian interviews, as opposed to letting
custodians point you to where the data might live, will help you uncover more valuable and useful information
and help to limit surprises during the review phase.
Treating eDiscovery Search as Enterprise Search
As noted by Kamal Shah in his article, “Enterprise Search vs. E-Discovery Search: Same or Different?”8,
enterprise search is based on speed and simplicity, which is designed to process a single search query and
deliver results for that particular search. As in the case of Google, the search engine executes the query and
the user isn’t privy to what the engine has actually searched for. In eDiscovery, speed and simplicity are also
important, but in addition – as Judge Grimm stated in the Victor Stanley case – users have to be able to show
how their searches were executed.
Searching with Blinders On
In discovery, people tend to begin a search thinking they know what they’re searching for and often times
don’t deviate from their original assumptions. Keyword searching has traditionally been a culling process to
remove nonresponsive documents and has been treated as a linear progression, with search terms being
used to narrow down what is relevant to the case. With these approaches, key themes, documents and even
potential custodians can get overlooked.
Search terms should be used to target what you know and investigation of those results should be used
to uncover what you don’t know. Knowledge is power and it’s hard to expand your knowledge if you don’t
research the things you don’t know. Keywords should guide the direction of your investigation, not limit it.
Broadening the scope helps identify things such as sidebar conversations that might not get picked up by
blind searches. The themes contained within a data set can go in many directions that all get to the same
conclusion – how can you reasonably find the themes when you are only chasing one of them?
A New Approach
In addition to fixing some of the problems that are currently hindering search effectiveness, there are other
changes and technological solutions that would greatly improve the ways in which keyword searches are
conducted and employed.
Search Needs Technology and a Human Touch
Keyword searching should be a blended, iterative approach that combines both humans and technology
– never only one or the other. As the Blair and Maron study demonstrates, a computer can’t distinguish
relevant cases from irrelevant cases by searching on the full text of a case. Once a person understands the
case at hand and the language in the documents, then technology can be applied. A computer can’t interpret
Investigate Data Early and Often
Spending more time at the beginning of the process to fully understand the data helps ensure that the
data you’re moving into review is most likely to be relevant. Search is not just about which documents
should be produced – early analysis can provide you with insight into the case itself.
The more you investigate the data on the front end, the more you can uncover, understand, plan, and
make informed decisions throughout the process. Testing search terms early gives you a feel for what’s
relevant and what’s not and will inform your direction. You’ll start to uncover documents and individuals
who weren’t necessarily on your radar screen, and common challenges, such as uncovering a key custodian
at the eleventh hour of review, can potentially be avoided.
Start with what is known and expand out from there. There’s always information with which to start –
use that to point you in the right direction and help develop the search terms. Take a few established things
(key time periods of interest, custodians who would have had direct involvement in the matter, the basics of
what the litigation is about) to set the direction of the case and go where the data points you. Pick a term to
use as a starting point and search through the metadata: file names, email subjects and other areas where
custodians are more likely to use the standard language to talk about the subject of interest. From there,
you can look into the documents to determine people who were referenced in the document and other relevant
In the Victor Stanley case, Judge Grimm cites The Sedona Conference® Best Practices as a source of
best practices stating, “In this regard, compliance with The Sedona Conference® Best Practices for use of
search and information retrieval will go a long way toward convincing the court that the method chosen was
reasonable and reliable.”
The Sedona Conference® Best Practices document suggests that key components of an effective search
- Testing searches to identify whether they are producing over - or under-inclusive - or under-inclusive results.
- Sampling documents determined to be privileged or not to arrive at a comfort level that the categories are
neither over-inclusive nor under-inclusive.
- Getting iterative feedback by refining searches based on the testing results and validating the refinements.9
Find the Right Solution
Look for a solution that:
- Allows you to weed out false positives – To avoid inclusion of irrelevant files, you need the ability to detect
variations in the data. A solution that identifies uniqueness and outliers will help pinpoint issues with particular
- Lets you establish workflows and replicate exercises – Data tells a story. As you add more data, that story
can veer in many directions. You need a solution that can help you retrace your steps, easily apply what you
learned in a dynamic fashion, and sample areas of the population that have gone unexplored. Define an iterative
process and continue to re-visit until there is comfort that a reasonable effort has been made to uncover the
documents relevant to the request with documentation to show what was done and steps that were taken to
- Has flexible and faceted search capabilities – This allows keywords to be removed, expanded, stemmed,
and modified iteratively during processing. It provides the ability to apply, evaluate and organize searches, as
well as group and classify sets of documents that are isolated and comment as to why they’re important.
- Provides detailed reporting throughout the process – Instead of producing one final search report, look for
a tool that generates details about data early and often.
- Gives you the ability to transfer knowledge and background to other involved parties – Make search term development and history transparent to people further downstream in the process, including opposing counsel
and the courts, if needed.
- Offers users Web-based access to the data – This gives users the ability to receive reports and dive deeper
into data directly from their browsers.
The Net Net
Keyword searching is currently a powerful but misused tool. Properly used, It can produce less – but more
applicable – data in a more cost-effective way. Re-examining how keywords are applied and choosing the right
tool for the job are vital to making that happen. It’s not just a tool to whittle down data, but a way to investigate
data to discover more and review less. Taking the time up-front to assemble and examine the data to develop
and iterate on keywords saves time and money further downstream. This measure-twice-cut-once strategy is
a reliable way for legal departments and law firms to develop more effective and predictable budgets. And
providing transparency into the process of developing keywords will minimize legal exposure and meet the
expectations of the courts.
1 David Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval
2 F. Zhao, D. Oard and J. Baron, Improving Search Effectiveness in the Legal E-Discovery Process Using Relevance Feedback, (June 2009)
3 The Sedona Conference®, Best Practices Commentary on Search and Retrieval Methods, (August 2007)
4 Philip Favro, Mission Impossible? The eDiscovery Implications of the ABA’s New Ethics Rules, (August 2012)
5 Rand Corporation, Where the Money Goes, (2012)
6 Ralph Losey, Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search, (2009)
7 Milton I. Shadur, Kleen Products, LLC, et al v. Packaging Corporation of America, (April 2011)
8 Kamal Shah, Enterprise Search vs. E-Discovery Search: Same or Different?
About the Author
Jeff Fehrman is Chief Strategy Officer of Mindseye Solutions. Mr. Fehrman spent 3 years at
Integreon, where he provided consulting and advisory services to law firms, corporations, and
government clients throughout the eDiscovery lifecycle.
Prior to his tenure at Integreon, Mr. Fehrman spent 7 years as a vice president with ONSITE3.
He’s currently on the Board of Governors for the Organization of Legal Professionals (OLP) and has
been an active member in the Sedona Conference’s Working Group on Electronic Document Retention
and Production since 2006 as well as the co-founder of EDD Blog Online. He is based in Arlington, VA.