Technology Assisted Review – Three Streams Flowing Into One River

Technology Assisted Review:

Three Streams Flowing Into One River

Terry Dexter



The litigation world is undergoing something of a ‘Sea Change’ with regard to the Discovery Process. Driven by the advent of Electronically Stored Information (ESI), Discovery is slowly transforming from:

part of the pre-trial litigation process during which each party requests relevant information and documents from the other side in an attempt to “discover” pertinent facts. Generally discovery devices include depositions, interrogatories, requests for admissions, document production requests and requests for inspection1.”


refers to any process in which electronic data is sought, located, secured, and searched with the intent of using it as evidence in a civil or criminal legal case. E-discovery can be carried out offline on a particular computer or it can be done in a network. Court-ordered or government sanctioned hacking for the purpose of obtaining critical evidence is also a type of e-discovery.2

Now, while Discovery’s traditional definition and devices are still valid and enforceable they are slow, tedious and expensive. Electronic Discovery (eDiscovery) shows promise in reducing the volume of documents to be reviewed thus saving time and money.

Several commercial businesses to providing litigation support have sprung up in the past ten (10) years. This paper examines three (3) technologies through which eDiscovery may be conducted and examine pertinent issues brought about by eDiscovery.

Current eDiscovery technologies

Predictive Coding

Currently in vogue due in large part by the DaSilva Moore3, Global Aerospace4 and Kleen Products5 cases currently being heard in their respective courts. Each of these cases involved Predictive Coding but vary on the implementation as explained by Brandon D. Hollinder on the eDiscovery Blogspot on April 25, 2012:

  • In Da Silva Moore, the parties initially agreed to use predictive coding (although they never agreed to all of the details) and Magistrate Judge Peck allowed its use.  Plaintiffs have since attacked Judge Peck and most recently formally sought his recusal from the matter.  That request is currently pending.
  • Global Aerospace Inc., et al, v. Landow Aviation, L.P. dba Dulles, is the most recent case to address predictive coding, and it goes a step further than Da Silva Moore.  In Global Aerospace, the defendants wanted to use predictive coding themselves, but plaintiffs objected.  Virginia County Circuit Judge James H. Chamblin, ordered that Defendants could use predictive coding to review documents.  Like Da Silva Moore, the court did not impose the use of predictive coding, rather, the court allowed a party to use it upon request.
  • Kleen Prods., LLC v. Packaging Corp. of Am. goes the furthest, and is perhaps the most interesting of the three predictive coding cases because it is different than Da Silva Moore and Global Aerospace in one very important way: the plaintiffs in Kleen are asking the court to force the defendants to use predictive coding when defendants review their own material.  The court has yet to rule on the issue.”

All of the recent Gartner 2012 ‘Leaders Quadrant’ utilize Predictive Coding in their products. Essentially, Predictive Coding:

  1. start[s] with a set of data, derived or grouped in any number of variety of ways (e.g., through keyword or concept searching);
  2. use[s] a human-in-the-loop iterative strategy of manually coding a seed or sample set of documents for responsiveness and/or privilege;
  3. employ[s] machine learning software to categorize similar documents in the larger set of data;
  4. analyze[s] user annotations for purposes of quality control feedback and coding consistency.6

Speaking purely from the technological perspective, Predictive Coding is merely an application of proven Bayesian7 Statistical theory. In this case, it is used to reduce the volume of document files selected for discovery. The fundamental hypothesis driving the search may vary depending which side is formulating the hypothesis, but the model doesn’t really care. It will produce new data for repetitive inclusion in subsequent runs (called “wash, rinse repeat cycle by Sharon Nelson8) regardless of which side is conducting the search on the same corpus of information. Thus, it would appear the sides are almost arguing how many angels can dance on the head of a pin.

Does Predictive Coding work? If viewed from a strictly Bayesian model perspective yes as such models have been in use for years. If viewed from the Predictive Coding/litigation perspective, the smaller number (volume) of documents produced for relevancy coding does cut time compared to manual review. However, the question should really be posed as: “Are the hypotheses producing the desired results?” That is a question that surrounds each and every case involving discovery. The Defendant may certainly dump hundreds or even thousands of files to swamp Plaintiff; but could judicial sanctions be far behind in such circumstances? Conversely, each time Plaintiff requests additional time to continue refining their hypothesis claims of ‘fishing trips’ and then judicial sanctions may not be far behind.


Cost Effectiveness over time – As the volume of ESI increases, cost effectiveness decreases as ‘experienced attorneys’ must review each document for relevancy thus slowing the overall process, Is there an upper limit to the ESI volume before Predictive Coding methods become too expensive and too slow (aka “Return to GO and collect $200”)?

Quality control in assembling seed groups, preparing to run and relevancy of produced documents is not a priority. One vendor, Compiled Services, LLC suggests mistakes are frequently made in the electronic discover process:

What “mistakes” are we talking about? In the electronicdiscovery process of collecting, preserving, de-duplicating, filtering, culling, and reviewing ESI, each stage represents opportunities for errors ranging from entering improper date ranges to failure to accurately enter specific format requirements for tools utilized in downstream stages of the process. Like an assembly line, each step in the discovery process has its own issues related to configuration, setting parameters, calibrating specifications, and tired multi-tasking humans responsible for monitoring every aspect of billions of pieces of data. That is to say, quality is an overwhelming challenge for a discovery process that isdealing with ever-growing volumes of data with each passing day. The opportunity for minor mistakes, oversight, or simple carelessness comes from the fallible nature of people who simply cannot guarantee 100 percent focus and attention to such massive quantities of information day after day after day.”9

When it comes to defects (or ‘mistakes’), I prefer Dr. W. Edwards Deming’s first principle:

Create constancy for the improvement of product and service10

when dealing with humans interaction with automated processes. In other words, no one involved in the project should permit a defect being inserted into the end product. It’s an ‘attitude thang’ that should permeate each and every member of the organization conducting eDiscovery.

Unknown scalability – is there an upper limit to the size of the corpus of documents to be searched? How many runs become too burdensome to complete the search? We’re far too early to consider scalability as an issue, but it should be something in the back of people’s minds as the technique becomes ubiquitous.

Semantic Web

Whereas Predictive Coding applies statistical analysis to identify a group of documents that may have relevant case information, it is not the be-all/end-all. Even with the explosive growth in Electronically Stored Information, surely there is a better means to obtain legal information in a concise manner. For that matter, what humans call ‘information’ is merely ‘data’ (actually 1s and 0s) to a computer. How can these two parts of the equation overcome what is essentially a communications roadblock? Enter the Semantic Web.

The Semantic Web is not one single concept; but is a natural outgrowth of the original World Wide Web (WWW) design. Sir Tim Berners-Lee, the creator of web 1.0, describes it “as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”11 It is comprised of a set of design principles and technology (see illustration12) to formalize the representation of meaning between humans and computers.

To Zachary Adam Wyner13, the aspect of the Semantic Web most pertinent to t

Generalized ontology structure

his discussion is the concept of ‘ontology’14 or vocabulary. This vital tool is used to create a unique and formalized structure, syntax and vocabulary for the ‘world’ in which it describes, in this case the legal world. Comprised of several inter-operating layers, an ontology filters and translates human-machine understanding. The fundamental component upon which the ontology is based is the eXtensible Markup Language (XML). XML’s purpose is to transport and store data, with focus on what data is15. XML is used to create file descriptors, known as ‘tags’, using syntactical concept of:

<tagname> descriptor content </tagname>


<moviename> Star Trek </moviename>

These tags can thus provide an additional, richer description of the contents of the file. Existing near the ‘bottom’ of the Semantic Web ‘stack’, XML is hardware and software agnostic. The components between the XML and User Interface & Applications layers are the formal ontologic languages RDF, OWL and SPARQL.

  • RDF (Resource Description Framework) provides the foundation for publishing and linking data.
  • OWL (Web Ontology Language) is used to build vocabularies.
  • SPARQL is the query language for the Semantic Web.

These languages, derived from World Wide Consortium (W3C) recommendations provide consistent means of information interchange between vocabularies. While they do provide a formal, standardized way of sharing ontology information, making this sharing work at the human level is left to the tools compromising the layers above them.

Vocabularies may be found at two (2) different levels of detail:

  • Core which models general concepts which are believed to be central to the understanding of a world (i.e.; law)
  • Domain which focuses upon the representation of more specific areas (e.g., copyright) and are thus built for particular applications.

Building a core ontology covering all aspects of legal theory, practice, precedents, etc. is quite out of the question today. However, domain specific vocabularies covering more specific areas are not only possible, but are being considered by some European organizations.

Fortunately, we are not concerned about the ontological structure for the entire corpus of legal information. What we are focusing on is the body of information to be searched during eDiscovery. There are three (3) natural solutions to this thorny problem:

  1. The Defendant has previously compiled and constructed an ontology covering their entire documentation. This is the simplest solution and, thanks to the Sarbanes-Oxley (SOX) legislation may already be in progress. Garrie and Armstong16, while arguing the affects of SOX in light of the Zubulake17 decisions, make note of the following:

Prior to Sarbanes-Oxley most public and private companies in industries other than financial services and healthcare did not have to comply with burdensome legally mandated data retention policies. Under Sarbanes-

Oxley, however, public companies are distinguished from their private counterparts in that they must retain financial data in order to comply with the legislation. Not only are public companies forced to retain more data than private companies, but public companies are now required to maintain the data in an easily accessible manner.”

Companies have incorporated Semantic Web capability into Documentation Management (DM) systems. Jennifer Zaino reports:

… the movement to include semantic capabilities as part of DM systems has already started, <George Roth, president and CEO of Recognos Inc.> says. He cites as an example Microsoft Sharepoint11 and the vendor’s $1.2 billion buy of search vendor FAST Search awhile back. “The shift to semantic search [for the enterprise] is happening big-time, and I think Microsoft is one of the leaders in this,” Roth says, even if Microsoft isn’t advertising the semantics behind its system.18

  1. The Defendant hires a document management, such as Recognos, to construct an ontology in response to litigation. Of course, the amount of time and costs associated with the effort would be subject to judicial review and approval.
  1. Revert to Predictive Coding or manual methods.

What does all this techno-babble mean to litigators? Well, Semantic Web searches, as seen simply from surfing the Web, is capable of identifying potentially valid Electronically Sourced Information (ESI) very quickly. Given the increasing use of electronic storage, backup and the attendant ‘metadata’ (descriptor tags), associated with ESI, properly framed queries have a higher probability of identifying relevant material. Depending upon the detail of the vocabularies involved, it may also be possible to identify and select relevant material in a fraction of the time necessary to conduct Predictive Coding. Such ‘maybes’ are dependent upon the design of the vocabulary, the amount and content of descriptor tags and the overall implementation of a domain specific ontology.


  1. Massive up-front costs in creating unique domain level vocabularies – a brief review of the W3C Case Studies shows the level of effort needed to construct a corporate wide ontology. Corporate organizations from General Counsel, Finance, Security, Human Resources and, of course, Information Technology must provide input, guidance and monitor the evolving design.
  2. Massive deployment costs associated with the corpus of existent ESI – a corporation’s Intranet may possess some of the necessary connections, and the corporation’s CobiT effort may provide more. However, implementing a full scale vocabulary is labor intensive.
  3. Formal language and syntax – the English language is full of ambiguity which means humans need to learn the formal (structured) language and syntax of the W3C Web components.

Promising eDiscovery Technologies

Technologists know that, given time, technology follows Moore’s Law19. Newer, better more accurate tools and methods will be introduced and replace the current ‘Top Dogs’. One only needs to view the Cellular/Mobile Communications sector for proof. When considering that Electronic Discovery has only been around for 10 years, the introduction of Technology Assisted Review may be viewed as being younger than 10 years.

Already we see court cases in which one party or the other is complaining about eDiscovery costs. Indeed, while technology serves to reduce the cost (time and labor) of producing potentially valuable documents for review. Unfortunately, review costs account for upwards of 70% of the total of a given eDiscovery project.

New technology and methods are coming that should reduce costs due to human review.

Natural Language Processing


We’ve briefly examined Predictive Coding’s algorithmic/human review hybrid and wrapping document files in the formalized XML language vocabulary. Wyner20 differentiates the difference between the Predictive Coding and Semantic Web methods as follows:

Predictive Coding is ‘knowledge light’ in that “… the processing presumes very little knowledge of the system or analyst”. Thus, when the statistical models are applied to the (often very) large population of documents, the contents are evaluated as meeting or not meeting query specifications.

Semantic Web is also ‘knowledge light’ but not to the same extent as Predictive Coding. In this method, wrapping a file with informational tags (metadata) does add some level of knowledge to the search. However, such searches are dependent upon the content of the tags, which, in turn, are dependent upon the knowledge the expert contributors bring to the design of the vocabularies.

However, there is a third method

Natural Language Processing (NLP) is ‘knowledge heavy’ in that rather than search for similarities and/or differences or search amongst tags, we know what we are looking for and we examine the actual file content.

The sort of ‘Natural Language Processing’ we speak of here is not the sort of HAL 9000 computer interface where one speaks to a computer which responds in one’s own language. In this case, we are considering the myriad and literally uncountable words kept as Electronically Stored Information (ESI).

The written language of any culture is multi-dimensional in nature. Consider:

Approximate age of the writing can be established by the syntax and lexicon of the writer. For example, the epic poem of Beowolf is written in Old English which contains characters and pronunciation not found in Modern English. Thus, someone from this era reading this alliterative poem in its original form faces a challenge as great as Grendel and his Mother.


  • The intellectual level of the writer(s) may be derived from the document’s lexical density21, The higher the density, the more information is being communicated by the writer. Consider reading John Locke’s First Principles in one sitting. Not only does one need to wade through archaic language structures, the philosophical concepts themselves are hard to grasp.
  • The tone of the document is identified by the word choices and within the context of other, related documents. This is especially true with electronic mail or social networks. For example a serious peer reviewed document may contain examples where the author(s) hotly dispute another’s conclusions but never outright label those conclusions as “imbecilic”. In contrast, email or social web sites may contain language that would shame a sailor.
  • Some say the writer’s gender may be discerned through the words, syntax and imagery contained in the document. For example, no one could confuse the author’s genders reading a Tom Clancy or Marion Zimmer Bradley novel; even if all information about the author was not revealed.

Each of these dimensions requires an innate ability humans possess and computers can be ‘taught’ but never acquire themselves. The ability to ‘Comprehend’ or ‘Understand’ concepts contained within a document, communication or a series of documents, is the hallmark of Human Reasoning. The RAND Corporation has an excellent summary of the effort a human reader encounters when dealing with reading and comprehension:

Comprehension does not occur by simply extracting meaning from text. During reading, the reader constructs different representations of the text that are important for comprehension. These representations include, for example, the surface code (the exact wording of the text), the text base (idea units representing the meaning),and a representation of the mental models embedded in the text22.

It is the uniquely human ability to create constantly changing, internal models (or representations) of the text in order to fully comprehend the entirety. No computer in existence possesses the innate and inherent capability required to perform the task of comprehension. That is, unless and until, an application program (or series of programs) is built telling the computer EXACTLY how to accomplish it.

Of course, even now, researchers are hard at work creating computer systems capable of accurately simulating human cognition and reasoning (aka ‘Artificial Intelligence’)23 24.

Legal Sector Implications

Whereas Predictive Coding and Semantic Web methods simply identify potentially valuable information, Natural Language Processing can be used to actually read the contents of a document and, eventually, assess that information for relevancy. It is this singular ability that leads to true lower costs for all concerned litigants.

Natural Language Processing isn’t perfect yet: computers cannot understand human language. However, legal text is quite structured, and offers a lot more handholds for automated translation than, say, a novel”25.

Wyner and Peters have postulated what can be called an “interim solution” using semantic annotation within a document26.

“To analyse a legal case, legal professionals annotate the case into its constituent parts. The analysis is summarised in a case brief. However, the current approach is very limited:

  • Analysis is time-consuming and knowledge-intensive.
  • Case briefs may miss relevant information.
  • Case analyses and briefs are privately held.
  • Case analyses are in paper form, so not searchable over the Internet.
  • Current search tools are for text strings, not conceptual information. We want to search for concepts such as for the holdings by a particular judge and with respect to causes of action against a particular defendant. (emphasis added)

With annotated legal cases, we can enable conceptual search.”

Conceptual analysis27 is a key NLP component. One example of conceptual analysis has been applied to detect plagiarism

in student-submitted papers is described by Dreher28 Granted, the systems selected by Dreher approach ‘discovery’ using relatively common string-by-string comparison methods, such methods still require knowledge of language to identify relevant comparisons.

Dr. Kathleen Dahlgren and her team at Cognition Technologies have taken a different and highly interesting NLP approach.

1Downloaded from the ‘Lectric Law Library on 12Jun03

2Definition of electronic discovery (e-discovery or ediscovery) downloaded from on 12Jun03

3Monique da Silva Moore, et. al. v. Publicis Group SA, et al. Case No. 11-CV-1279 U.S. District Court for the Southern District of New York

4Global Aerospace Inc. v. Landow Aviation, L.P., No. CL 61040 (Va. Cir. Ct. Apr. 23, 2012) Circuit Court for Loudon County

5Kleen Products, LLC, et. al. v. Packaging Corporation of America, et. al. Case No. 10-CV-05711


6M. Whittingham, E. H. Rippey and S. L. Perryman quoting Jason R. Baron, Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, 17 RICH. J.L. & TECH. 9, 32 (Spring 2011) in Litigation Support Technology Review

7Bayesian statistics is an approach for learning from evidence as it accumulates. In clinical trials, traditional (frequentist) statistical methods may use information from previous studies only at the design stage. Then, at the data analysis stage, the information from these studies is considered as a complement to, but not part of, the formal analysis. In contrast, the Bayesian approach uses Bayes’ Theorem to formally combine prior information with current information on a quantity of interest. The Bayesian idea is to consider the prior information and the trial results as part of a continual data stream, in which inferences are being updated each time new data become available. Downloaded from Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials on 12Jun05

8“Predictive Coding: Dozens of Names, No Definition, Lots of Controversy”, Sharon D. Nelson, Esq downloaded from on 12Jun13

9“Quality Control in the age of digital data’, Compiled Services, LLC White Paper, downloaded from on 12Jun03

10Deming W.E., Out of the Crisis, Chapter 2, “Elaboration on the 14 Points”, Published by Massachusetts Institute of Technology, Center of Advanced Education Services, Cambridge, MA, 1986

12Casellas, ibid.

13Wyner, A. Z. “Weaving the Legal Semantic Web with Natural Langugage Processing”, VoxPopuLII, 17May2010, retrieved from on 12Jun13

14Casellas, ibid – “ontology refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world.”

15“Introduction to XML” downloaded from on 12Jun13

  1. 16Garrie, D.B. & Armstrong , M.J. “Electronic Discovery and the Challenge Posed by the Sarbanes-Oxley Act ”, downloaded from on 12Jun15

17Zubulake v. UBS Warburg LLC, 217 F.R.D. 309, 322 (S.D.N.Y. 2003)

19Moore’s Law is a computing term which originated around 1970; the simplified version of this law states that processor speeds, or overall processing power for computers will double every two years.

20Wyner, ibid

21Williamson, G, from Lexical Density (1) lexical words (the so-called content or information-carrying words) and, (2) function words (those words which bind together a text).

23See: Science Daily pages on Artificial Intelligence and Cognition –

24See Leibniz Center for Law at

26Wyner, A. and Peters, W. “Semantic Annotations for Legal Text Processing using GATE Teamware

” retrieved from on 12Jul12.

27The division of a physical or abstract whole into its constituent parts to examine or determine their relationship or value

28Dreher, H., “Automatic Conceptual Analysis for Plagiarism Detection””, Issues in Informing Science and Information Technology, Volume 4, 2007


Beware the Jabberwock and Bright Shiny Objects!

Beware the Jabberwock and Bright Shiny Objects!

Terry Dexter


I really haven’t seen such a commotion over the introduction of new ways of accomplishing the same goal since the introduction of dedicated “Word Processors” and “Word Processing Software”. The business was bombarded by new technology promising to reduce work loads, streamline business processes and, I heard this directly from one company president, ‘…eliminate that wasteful typing pool!’ Today, the legal sector is experiencing the very same phenomenon that business went through back in the early 1980s – Technology Assisted Review.

Now, I’m not a lawyer, merely an English major who has spent 30 years working in Information Technology. As such, I’ve seen slow adoption of new technology, wild successes and massive failures. One trend that is acknowledged but never really discussed is captured in this passage from a Rackspace Whitepaper written by John Engates and Robbie J. Wright1

But even though cloud computing may seem like the ultimate platform for traffic flexibility, the closed platforms that some providers offer can reduce an enterprise’s provider flexibility. The industry refers to this phenomenon as “lock-in.” By using a given hosting company’s proprietary system or an ISV’s proprietary virtualization platform, switching to another provider of platform often becomes so expensive, difficult and time-consuming that enterprises instead remain locked in to their existing providers, reducing bargaining power and, therefore, the ability to seek greater value.”

Granted the article does not even mention the legal sector or Predictive Coding, there is, however, much upon which to contemplate. The key words in this passage are ‘lock-in’ – an all too common phenomenon in IT. Everyone becomes so focused on a particular technology that when something new and better comes along, they are incapable of taking advantage. For example, see the Windows vs Apple vs Linux conundrum. Windows wins hands down since business selected that platform over the technologically superior Apple (I’ll reserve judgment on Linux Luddites for a later essay). This is what I fear regarding Predictive Coding in the legal space. Given the plethora of comments, articles, essays and the odd judicial ruling it appears the trend is coming true.

For example, consider Marisa Peacock’s CMS post from January 23, 2012:

There’s no doubt that predictive coding is the next big thing in e-Discovery. While predictive coding aims to code, organize, and prioritize entire sets of electronically stored information (ESI) according to their relation to discovery responsiveness, privilege and designated issues before and during the legal discovery process, many e-Discovery vendors have been working hard to offer products and service that offer predictive analytical solutions.2

Adding fuel to the fire is the famous Gartner Magic Quadrant Reports. The first such report discussing Electronic Discovery came out in 2011. Of the companies found in the “Leaders” quadrant, all were utilizing Predictive Coding (PC) technology. The 2012 “Leaders” quadrant saw small changes (mostly due to purchases by big corporations), but still dependent upon PC toolsets.

The danger I see is that if large corporations (i.e., Hewlett-Packard, Symantec, IBM, etc.) are jumping on the PC bandwagon and promoting their wares what may be in store for the future?

  1. How easy will it be for those same companies to adapt to newer ways or will the Technology Assisted Review (TAR) sector be permanently (or at least for a long spell) hampered by PC centric technology?
  2. Can the existing technology scale up as the volume of Electronically Stored Information (ESI) grows?
  3. Even if it can, what will be the true cost as Defendants shower Plaintiffs with gigabyte range ESI files?
  4. In what manner will the PC process be governed to produce relevant documents in a repeatable fashion?

I would gladly accept your thoughts on this subject. I, for one, do not to see this sector get distracted by the bright, shiny PC object and get locked into something that may become more of a monster.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!””

1Unlocking the Enterprise Cloud – How the OpenstackTM Eliminates Cloud Lock-in, Rackspace White Paper downloaded from on 30 May 2012

2“Equivio, Clearwell Add Predictive Coding for e-Discovery #ltny”, Downloaded from on 30 May 2012

3 “JABBERWOCKY” by Lewis Carroll (from Through the Looking-Glass and What Alice Found There, 1872)

Assailant suffers injuries from fall

Assailant suffers injuries from fall

Orville Smith, a store manager from Best Buy in Augusta, Ga., told police he observed a male customer, later identified as Tyrone Jackson of Augusta, on surveillance cameras putting a laptop computer under his jacket. When confronted the man became irate, knocked down an employee, drew a knife and ran for the door.

Outside on the sidewalk were four Marines collecting toys for the Toys for Tots program. Smith said the Marines stopped the man, but he stabbed one of the Marines, Cpl Phillip Duggan, in the back; the injury did not appear to be severe.

After Police and an ambulance arrived at the scene Cpl. Duggan was transported for treatment.

“The subject was also transported to the local hospital with two broken arms, a broken ankle, a broken leg, several missing teeth, possible broken ribs, multiple contusions, assorted lacerations, a broken nose and a broken jaw … injuries he sustained when he slipped and fell off the curb after stabbing the Marine” according to a police report.