The Best-Kept Secrets to Using Keyword Search Technologies

Part 2 – Building Structured Searches

By Philip Sykes and Richard Finkleman

 

Introduction

In Part 1, we covered the two major indexing and search engines most often used in eDiscovery solutions, dtSearch and Lucene. Now we will look at using a structured methodology for building searches. We will work with dtSearch syntax, but the same approach applies when using a Lucene-based eDiscovery solution.

Proposed Search Terms

When starting with a set of proposed search terms, the first step is to organize and clean up the list. In constructing your searches, it is recommended that, unless you work with a case-sensitive index, you use lower case for the search terms and upper case for the Boolean operators to differentiate them. In this phase, identify:

  • Similar sets of terms that can be efficiently combined. For example, if there are several AND searches containing common words like "contract AND sales," "contract AND modification," or "executed AND contract," these can be combined into [(contract AND (sales OR modification OR executed))].
  • Terms that are unnecessary because they would be found by other variations of the words that are in the list. For example, if the list contains "contracts," "contracted," "contractual," and "contract*," the only term needed is contract*, since it will return everything the other three terms would find, and presumably more.
  • Terms that need to be extended. For example, a list containing names like "William Jones" and "Robert Smith" probably needs to be extended to include "Bill Jones," "Bob Smith," and possibly "Rob Smith." Peoples’ names should also normally be converted to proximity searches like [((bill OR will*) w/3 jones)] and [((bob OR rob*) w/3 smith)].
  • Terms with the same spelling but containing different punctuation and cases (unless there will be a case-sensitive index). For example, if the list contains the words "case sensitive" and "case-sensitive," only "case sensitive" is needed because the hyphen is treated as a space during indexing. If the list contains "Sales Agreement" and "sales agreement," only "sales agreement" is needed.
  • Problematic proximity searches. Normally these will be "nested" searches like "certified w/5 records w/2 filed." While the search may return results, it is an ambiguous search that needs to be properly defined to ensure that the dtSearch syntax is correct. The first step for cleaning up this nested proximity search is to add parentheses to clarify the search requirements:
  •  Is the request [((certified w/5 records) w/2 filed)], or should it be [(certified w/5 (records w/2 filed))]?
  • If we assume the desired search is [(certified w/5 (records w/2 filed))], then we need to know if just "records" must be within five words of "certified" or if either "records" OR "filed" can be within five words of "certified" as long as "records" is within two words of "filed." In other words, which of the following sentences should be returned by the search?
    • 1.       He left after he filed the records that had been certified, or
    • 2.       He certified that he had filed the records.

If only "certified" needs to be within five words of "records" (as in sentence 1 but not sentence 2), then the correct dtSearch syntax is:

[((certified w/5 records) AND (records w/2 filed))]

If either "records" or "filed" needs to be within five words of "certified" (as in both sentences 1 and 2), then the correct dtSearch syntax is:

[(((certified w/5 records) OR (certified w/5 filed)) AND (records w/2 filed))]

  • Using Boolean operators without any proximity boundaries means that the terms joined by the operators can appear anywhere in the document. So, when you start testing, if a search returns unexpectedly large numbers of documents, one consideration should be whether it is feasible to constrain the boundaries by using one or more proximity searches.
  • You must understand a couple of important factors when using a Lucene-based solution. First, since Lucene does not support wildcards inside quotes, you will need to identify the reasonable list of words to be used in place of the wildcard(s). While some Lucene-based tools provide the capability of viewing words in the index, others do not. If the tool you use does not provide this capability and you have an index in dtSearch, start a new search in dtSearch and type the word up to the location of the wildcard character. You will see the different words that are in the index. If the Lucene-based tool provides the capability of viewing the words in the index, just use the Lucene index. If this is not possible, run a search for the standalone term with the wildcard character (e.g., contract*) and note the number of documents returned. Then run a search for [(contract* AND NOT (contract OR contracts OR contracted OR contractual))]. (Note: If you identify any additional variations that should be included, they should be added at the end of the list.) The result will most likely be significantly reduced, and searching through the remaining documents for additional reasonable terms should be easy.
  • The other important factor, as was explained in Part 1, is that Lucene doesn’t support Boolean operators in a proximity search. Therefore, you need to parse out the query as efficiently as possible.
  • For example, if our dtSearch query is [((oil OR gas) w/10 (stock* OR future* OR swap*))], the parsing and reformatting would create a Lucene query ["oil stock"~9 OR "oil stocks"~9 OR "oil future"~9 OR "oil futures"~9 OR "oil swap"~9 OR "oil swaps"~9 OR "gas stock"~9 OR "gas stocks"~9 OR "gas future"~9 OR "gas futures"~9 OR "gas swap"~9 OR "gas swaps"~9], although it may be necessary to increase the ~9 to ~11 to deal with the orders of the search terms or to expand this query to include all of the terms with the orders reversed. (Note: The Lucene edit distance should be one less than the dtSearch proximity value, because the dtSearch proximity distance is "within" or the number of words between the terms plus one.)

Create a Test Set of Documents

It is often helpful to work with a test document set, particularly when complex proximity searches are to be evaluated. If a set of documents has already been reviewed and identified as responsive in the matter, it should be included in the test set along with a group of documents that is not responsive. If complex proximity and Boolean searches like what was shown above will be used, it is helpful to build test documents that contain the words and spacing that will verify the search syntax is correct.

Break the List of Terms into Discrete Parts and Test

The next step is to run parts of the keyword terms as preliminary searches and record the resulting numbers. If there are a large number of single terms or phrases, they can be combined with OR operators. But if you get a large number of documents returned, it is often helpful to split the search to determine which term(s) bring(s) back large numbers of documents. If you determine that a few terms return excessive numbers of documents, consider alternatives to limit the number. For example, either look for other terms that could be combined with an AND or consider using a proximity search to limit the number of documents being found.

Tracking the evolution of your search is extremely important. Microsoft Excel works well to not only track the steps, but also help make sure the query is assembled correctly. As you build the worksheet, make sure that you put your searches inside parentheses, for two reasons: first, it keeps the various searches discrete when you begin combining them later; and second, it prevents Excel from removing quotation marks when you paste the search terms into a cell.

Here is an example where we have run two separate proximity searches and entered the counts of documents, not hits, returned by the search. Once the phase 1 searches have been completed, total the counts of the individual searches:

Search String Phase 1

Counts

Totals

(jeff* w/2 skilling)

164

undefined

(ken* w/2 lay)

274

438

After the preliminary searches have been run, inspect the outliers to see if there is something wrong with the syntax that causes either an extremely high or low count. If the syntax is correct and the counts are high, consider revising the search to be more focused, if possible. For example, if it contains a proximity search, is the proximity value too high? If the Boolean AND operator is used, should it be tightened up by switching to a proximity search rather than by looking for the words anywhere in the document?

Once the review of the phase 1 searches is complete, start concatenating them by combining them with OR, and note the combined counts.

Search String Phase 2

Counts

((jeff* w/2 skilling) OR (ken* w/2 lay))

344

Make sure that the combined search count is greater than both of the phase 1 search counts but less than or equal to the phase 1 totals. The following examples show what happens if parentheses are not included around each preliminary search when you start concatenating:

Search String Phase 1

Counts

Totals

(oil OR gas) w/10 (stock* OR future* OR swap*)

1,113

undefined

bank* AND (gas OR oil)

1,927

3,040

These two phase 1 searches contain proper syntax when run individually, but after combining the two searches the results are:

Search String Phase 2

Counts

(oil OR gas) w/10 (stock* OR future* OR swap*) OR bank* AND (gas OR oil)

1,255

The resulting count is greater than the first phase 1 search but less than the second. After adding the missing parentheses, the corrected search results are:

Search String Phase 2

Counts

(((oil OR gas) w/10 (stock* OR future* OR swap*)) OR (bank* AND (gas OR oil)))

2,549

This result meets the criteria of being greater than both of the phase 1 counts and less than or equal to the phase 1 total.

Continue the search string concatenation until you have a single, combined search. If you start with ten individual searches in phase 1, then combine them in pairs so you have five searches after phase 2. Then start combining them, following the same process until you are down to a single search.

If you have created an index of known responsive and known non-responsive documents, test the search against both. The search should return all of the known responsive documents; if any documents are not returned by the search, examine those documents to figure out why they were not found, and modify your search accordingly. The search against the index of known non-responsive documents will hopefully not return a significant number of documents. If it does, locate the terms that register hits, and see if the related search terms can be made more focused without excluding any of the known responsive documents.

Have a Good Text Editor

A good text editor is essential for building your searches. Microsoft Word’s smart quotes can cause inconsistent behavior with some search tools. The problem is not related to the documents that were indexed, but to how the search tool parses the query text. While you can configure Word to not use smart quotes, you may want them for normal document work. You need to replace existing smart quotes/apostrophes in the search terms with straight quotes/apostrophes.

Notepad++ is an excellent open-source text editor that has several helpful features, and it is also Unicode compliant. One of the most helpful Notepad++ features is that it checks for matching parentheses in search phrases. Figures 1 and 2 show how mismatched and matched parentheses are displayed:

Figure 1: Mismatched Parentheses

 

Figure 2: Matched Parentheses

 Summary

Learning how to build structured searches is not just a skill needed for effective culling of data prior to loading a set of data for document review.  Properly constructed keyword searches are valuable during document review and production to sample and test results.  It is not uncommon to pick up new keywords or new variants of existing keywords as a review progresses.  The ability to go back to your previous searches, including searching data that was not originally deemed relevant is one of the best kept secrets of keyword analysis. 

Finally, the growing use of predictive coding has led some people to believe that keyword searching is not a valuable skill set, however as with all new technologies there are several evolving issues that keyword searching can address. It can still be a valuable tool for culling the number of documents that go into a predictive review system as a growing number of predictive systems support this approach.  Keyword searching in predictive systems is also a helpful skill for quality control during post review analysis.  It should be clear that understanding how the different tools work and having a consistent, structured approach to building your keyword searches are fundamental to a successful project regardless of which review technology is being used.

 

About the authors

Richard Finkelman

Richard Finkelman is a Director and Practice Group Leader of Berkeley Research Group’s Electronic Discovery Practice.  Mr. Finkelman brings more than 25 years of experience helping clients manage information in litigation, regulatory and business matters.  His experience includes assisting clients with all aspects of litigation support in complex matters ranging from Securities Class Actions to Intellectual Property Disputes to high profile Regulatory matters.

 

 

Philip Sykes
Philip Sykes is a Senior Managing Consultant with more than 20 years of experience working in the fields of litigation support and electronic discovery. His experience includes work on high-profile cases from HSR Second Requests from the FTC and DOJ to IP matters to complex securities cases. His expertise includes keyword analytics, data processing and analysis, data management, on-line review tools, and document productions. He regularly assists counsel and experts in understanding what information exists in databases and how it is relevant to their matters.

 

Powered by Wild Apricot Membership Software