Data collection[1] can be technically rigorous and complex because it involves extracting potentially relevant electronically stored information (ESI) from native sources for processing, review, and production. Recognizing this complexity, an effective collection strategy likely involves both legal and IT professionals.

Once your team is established, you should decide what, among the materials preserved, should be collected.[2] There are a few key guidelines to consider when beginning the collection process. First, avoid over-collecting. While it may be tempting to identify a relevant custodian and copy their entire computer or email folders, more data collected means more data processed and ultimately reviewed – all of which adds up to additional time and money spent on eDiscovery. Instead, consider preserving broadly but targeting collections so only the most likely potentially relevant ESI is collected. Second, consider tiering collections. A tiered collection strategy involves prioritizing data so only the most relevant data is collected immediately, and less relevant data is collected only when needed.

There are a variety of ways entities may approach the collection process depending on considerations such as: the volume of data involved in the legal matter; the type, number, and accessibility of data sources implicated; whether there are time constraints (production deadlines, retention schedules, etc.); and whether the matter is civil or criminal (which informs the decision about whether a forensic or logical copy is necessary).[3]

The mechanics of collection often differ among cases, too. Although clients often offer/desire to collect their own data (“self-collection”), the practice is generally discouraged by courts except in limited scenarios because self-collection potentially puts both the client’s interests and the lawyer’s practice at risk. In fact, one judge stated the practice “greatly troubles and concerns the Court.” EEOC v. M1 5100 Corp., d/b/a Jumbo Supermarket, Inc., Civil No. 19-cv-81320 (S.D. Fla. July 2, 2020). Other judges have similarly long-criticized the practice. Specifically, the Hon. Shira Scheindlin once opined against allowing custodians of ESI to collect their data, stating, “[s]earching for an answer on Google (or Westlaw or Lexis) is very different from searching for all responsive documents in the FOIA or e-discovery context…” and “most custodians cannot be ‘trusted’” to effectuate a legally sufficient collection. National Day Laborer Org. Network v US Immigration and Customs Enf. Agency, (10 Civ. 3488 [SAS] [SDNY 2012]). For example, the client may not fully understand the scope of their legal and ethical responsibilities, resulting in over-collection. Or they may under-collect because of self-incrimination concerns or conflicts of interest.

For these very reasons, attorneys have an obligation to be active participants in the collection process. Indeed, “attorneys have a duty to oversee their client’s collection of information and documents, especially when ESI is involved” because “self-collection by a layperson is highly problematic and raises a real risk that data could be destroyed or corrupted.” The M1 5100 court warned that counsel must take an active role in assisting their clients with the search, collection, and production of ESI because counsel is required to ensure that the production is complete and correct at the time it is made (FRCP 26(g)).[4]

Data Processing

Once the data is collected, it must be processed for attorney review. Without proper processing, the collection is a mix of file types and formats, attachments, useless system files, and duplicates. Processing, which is highly technical, allows the ESI to be culled, searched, and reviewed by attorneys (the costliest of the ESI steps)[5] for production.


The eDiscovery review process involves the review and analysis of collected documents to determine whether the document is responsive or not, and whether the document is privileged or otherwise protected. The review process is replete with case strategy and understanding the litigation “story.” Review is often conducted by a group of people, commonly referred to as a review team, comprising different levels of attorneys and other legal professionals. There are several considerations to keep in mind when thinking about review:

  • an effective review requires a deep understanding of the issues underlying the case;
  • consistency across reviewers is critical;
  • accuracy and thoroughness are important, as is speed (and so, ways to improve efficiency should be considered);
  • a quality-control step should be built into the workflow, as even the most experienced reviewers may make mistakes; and
  • a thorough and experienced team of reviewers should take measures to protect privileged documents and avoid inadvertent disclosure.

For this reason, privilege reviews are handled with the utmost care and are usually conducted by more senior attorneys, who typically have more experience reviewing documents and know how to spot protected information and log it correctly for legal purposes.

Improving the Review Process

When confronted with a large document to review, how can one improve review speed and accuracy? Some considerations include: defensibly control the amount of data collected; leverage proportionality – understand potential data volumes and attendant review costs prior to reviewing to expose disproportionate discovery requests and place the onus on the other side to demonstrate why the review is necessary; leverage technology including clustering, threading, search terms, and other technology-assisted review (predictive coding that is sampled and refined).[6]


The final output of the review, the set of documents deemed relevant to the legal matter, must be produced to the other side. Years ago, productions were mostly a hardcopy exercise and involved sending boxes of documents to the opposing side. Today, the production process is more complicated. Due to the complexities of producing documents in the digital age, the Federal Rules of Civil Procedure and several state court rules require litigants to produce ESI in the form in which it is ordinarily maintained (often referred to as “native format”) or in a “reasonably usable form.” This means, for example, that email cannot be converted to hardcopy. Standard non-native production formats include the Tagged Image File Format (TIFF) and Portable Document Format (PDF). Additionally, most productions involve the exchange of specified metadata (i.e., data about the data) fields, are branded with a Bates stamp, and are potentially designated confidential or redacted as necessary. Once these various steps are complete, your documents should be production-ready.

[1] Collection should not be conflated with preservation. While some may be tempted to collect ESI to preserve it, doing so may be a costly and inefficient exercise. Think of preservation in terms of ensuring potentially relevant data is not deleted. Collection, on the other hand, is the first tangible step towards producing documents to your adversary.

[2] Different data sources have different levels of accessibility and present different collection challenges. Categories of ESI that might need to be collected for eDiscovery include: active data (data one interacts with regularly, such as email); cloud data (data created and stored on cloud servers); mobile data (data from mobile devices, including call logs, text messages, instant messaging, geolocation data, and other application content); offline data (inactive but stored or archived data); backup tapes/data sources (systems designed to store data in the event it must be restored); hidden data (previously deleted or fragmented files that exist on various systems and are usually not readily visible to regular system users).

[3] At the most basic level, a forensic image is a complete copy of a drive – including those areas not allocated to active files (known as slack space). It is essentially an exact duplicate of the original drive. Forensic images provide data from previously deleted files. A logical copy, on the other hand, is a copy of the active contents of the computer’s directories. Often, a logical copy is sufficient unless there is suspicion of data tampering or in cases where previously deleted files are at the center of the controversy.

[4] Specifically, “the attorney’s signature on a discovery response ‘certifies that the lawyer has made a reasonable effort to assure that the client has provided all the information…responsive to the discovery demand’ and has made ‘reasonable inquiry into the factual basis of his response.’” 2020 U.S. Dist. LEXIS 117243, at *2 (S.D. Fla. July 2, 2020).  According to the Court, because counsel cloaked the client with unfettered discretion to, among other things, search for and collect documents, counsel failed to exercise the requisite supervision. The Court further noted as “very problematic” the fact that counsel “signed off on the completeness and correctness of his client’s discovery responses” when the attorney exercised neither supervision nor involvement in the process.

An attorney’s obligation to be actively involved is not new. Indeed, one of the first court opinions from 2004 addressing eDiscovery concerns was Zubulake v. UBS Warburg, LLC, 229 F.R.D. 422 (S.D.N.Y. 2004) (“Zubulake V”) where Judge Scheindlin stated that counsel “must oversee compliance with the litigation hold, monitoring the party’s efforts to retain and produce the relevant documents.” Judge Scheindlin also commented in 2012 in Nat’l Day Laborer Org. Network v. United States Immigration & Customs Enforcement Agency, 877 F. Supp. 2d 87 (S.D.N.Y. 2012) that custodians cannot be “trusted to run effective searches of their own files” without proper assistance of counsel. 

[5] Attorney review generally accounts for between 70% and 80% of total eDiscovery costs of a typical case.

[6] Federal Magistrate Judge Andrew Peck’s decision in Da Silva Moore v. Publicis Groupe (S.D.N.Y., 2012) is considered the first official judicial endorsement of predictive coding as a way to review documents. Today, most judges agree that predictive coding has a well-established place in eDiscovery. In Rio Tinto Plc v. Vale S.A. (S.D.N.Y., 2015), Judge Peck encouraged parties to be transparent about their use of, and specifics of, predictive coding.