When I first started out in litigation support, I spent a lot of time fixing and updating data load files. Back in the early 2000s, every photocopy place was claiming to be experts of document scanning and eDiscovery; and so, we ended up with a lot of incorrect or messy load files. It was an “interesting” time.
Of course, despite the mess and the weird formatting, the data load files are really just niche-specific delimited text files. Then what is a delimited text file?
A delimited text file is really just a plain text file with markers (delimiters) to help identify how the data separates into proper records and fields.
Within any proper database or spreadsheet system (like the Excel table in the picture above), we can very easily distinguish the various pieces of information shown. We know where each column starts and stops, we can identify each record or row, and each field is easily isolated. The data is organized and properly accessible.
But what if I want to take the above data out of Excel and put it into another system, like Concordance or some other document review tool? Each software platform has its own setup and quirks. Its display options, the way it functions, the way the data is indexed, all of that is different from platform to platform. We can’t just shove Excel into Concordance or vice-versa because the two simply won’t understand each other.
But, plain text is universal. We can convert data into plain text and load it into another platform as plain text because almost everything recognizes text.
The trouble is, if we were to take the text out of Excel (or any other software), we lose the organization that the original software provided and end up with something like the below:
We, along with whatever new system the data will be imported into, lose the ability to distinguish the individual records or columns. The information is now a jumbled mess, and unorganized data is essentially useless. So, we need a way to help delineate records and fields using just plain text. Those markers are delimiters.
The simplest and most common delimited text file is the CSV (Comma Separated Value) file, which uses the comma ( , ) as a delimiter and quotes ( ” ) as text qualifiers, like so:
Here we can see clearly the purpose of the comma and quotes. The comma separates each field value and the quotes inform us that despite there being multiple words with spaces, the entire phrase belongs in 1 single field. Thus:
"BIGBANK_00001677","BIGBANK_00001680","111.msg","Kevin Lockhart <BigBank>","Doug Stevens <DStevens@bigbank.com"
The use of quotes ( ” ) means that whatever is in between the quotes belongs together in the same field, and the comma means whatever is next, belongs to a new field. This allows the information to be properly organized and imported into a new system.
The issue of course is that ( , ) and ( ” ) are very prevalent within general documents in eDiscovery. If we were to use commas and quotes as the delimiters, most database systems would be confused by all the extra instances found in letters, reports, emails, etc. We would need delimiters that do not occur naturally or often “in the wild.”
In the late 80s, when Concordance was basically the only document review tool around, the standard delimiters introduced to the legal technology space were:
¶ ASCII Code (020) as the comma separator
þ ASCII Code (254) as the quote text qualifier
® ASCII Code (174) as the new line indicator
(The ASCII Code numbers being how you can type those characters, by holding down the ALT key on your keyboard and typing the numbers on your number keypad)
As the above 3 characters were highly unlikely to appear within business documents naturally they served as much better delimiters and qualifiers than characters like commas or quotes.
And even though as eDiscovery has evolved and more sophisticated document review database systems have appeared on the market, most systems still recognize the Concordance delimiters.