Legal TREC: The Next Generation
By Herb Roitblat, Ph.D.


 

A major issue in electronic discovery is dealing with the cost.  Much of the emphasis on addressing this cost is the question of whether machines can deliver a similar level of accuracy to human review.  There is some evidence to suggest that this is not a very high standard to meet. 

The National Institute of Standards and Technology sponsors an annual Text Retrieval Conference (TREC).  Since 2006, this conference has included a legal track.  The purpose of TREC in general is to support the development of effective information retrieval tools through comparative analysis.  Multiple teams, using whatever technology they want, work on the same data set and the same goals and their performance is compared and published.  The Legal Track of TREC is designed to "assess the ability of information retrieval technology to meet the needs of the legal community for tools to help with retrieval of business records …  [and] to better educate the legal community on the feasibility of automated retrieval as well as its limitations."

The report of the 2008 analysis recently came out and it contains a lot of very interesting information, not just about technologies that might be used in electronic discovery, but also about the accuracy of human review.

Each year, the TREC track coordinators choose a set of topics, a set of data, and tasks to evaluate.  In the legal track, these are intended to mimic as well as possible the kinds of situations that people actually face during discovery.  Each participating team submits results, which are then analyzed by a team of assessors. 

In the legal track, the assessors are primarily second and third year law students. Each assessor averaged only about 21.5 documents per hour, so the average assessor took 23.25 hours to review 500 documents undefined a substantial commitment of time and effort from a volunteer. If we had to pay a contract attorney $100 per hour, that would be $2,325 to review 500 documents or $4.65 per document.  In a case with several hundred thousand documents, the cost adds up.

In 2006, the TREC coordinators collected a set of 25 relevant and 25 non-relevant documents from each topic as judged by one assessor and gave them to a second assessor for an independent assessment. They found that the two reviewers agreed on about 76% of the documents. In addition, some of the topics used in 2008 matched those used in earlier years, so it was possible to compare the judgments made about the same documents and same topics in two separate years. 

Ten documents from each of the repeated topics that were previously judged to be relevant and ten that were previously judged to be non-relevant were assessed by 2008 reviewers. Of the documents that were judged relevant previously, the 2008 assessors judged just 58% to be relevant this year.  Conversely, the 2008 assessors judged as relevant 18% of the documents previously judged to be non-relevant. Overall, the 2008 assessors agreed with the previous assessors 71.3% of the time.  Other studies outside of TREC have found similar levels of (dis)agreement.

How do we interpret these findings?

These levels of (dis)agreement do not appear to be wildly different from those found in other studies. Inter-assessor consistency presents challenges to any study of information retrieval effectiveness and to the process of legal discovery.

When two different assessors or reviewers disagree about the responsiveness of a document, one of them is "wrong."  They could differ for certain systematic reasons.  One may be more concerned about the consequences of failing to produce responsive documents than another, for example.  They can also disagree for essentially random reasons.  One may be sleepier than the other.  One might be distracted.

Whatever the source of disagreement, the use of multiple reviewers in a case is common, but comparing their decisions is not.  The TREC results and others suggest that there are likely to be similar levels of inconsistency in these cases. Taking the prior year reviews as the standard against which to measure the 2008 assessors, they found only 58% of the documents deemed to be relevant by the prior reviewundefined58% recall. Similarly, the 2006 study, found that the second reviewer recognized as relevant, again, only 58% of the documents deemed relevant by the first reviewer. I do not believe that these results are an artifact of the TREC processes or procedures. Rather, I think that this level of inconsistency is endemic in the process of having multiple reviewers review documents over time.

Standard review practice, practice, in other words, may be grossly under-delivering responsive documents. At the very least, attorneys should seek to measure the consistency of their reviewers and the effectiveness of their classifications.  On the other side, for computers to match this level of accuracy is a fairly low barrier.  It is well within the realm of current technologies.

 

 

Bio.RoitblatHerb.jpg

Herb Roitblat, Ph.D.
CEO
Orcatec

 

 

“At the very least, attorneys should seek to measure the consistency of their reviewers and the effectiveness of their classifications.”




 
 


The Organization of Legal Professionals
44-489 Town Center Way Ste. D436
Palm Desert, CA 92260-2723
760-610-5462
info@theolp.org


The Organization of Legal Professionals is a non-profit organization dedicated to higher continuing legal education and certification exams.  We offer comprehensive webinars, online courses, and training for litigation support, trial presentation, and E-Discovery.  The OLP also offers the first online litigation support salary & utilization survey designed to give employees inside information
to move your career forward.

©2011 The Organization of Legal Professionals