Corpus Linguistics in the Supreme Court

I’ve long been interested in this subject, and was particularly pleased to have Justice Thomas Lee of the Utah Supreme Court and Stephen Mouritsen guest-blogging in 2017 about their groundbreaking work on the subject. The subject came up in yesterday’s argument with regard to Prof. James Phillips’ and Prof. Jesse Egbert’s forthcoming article, A Corpus Linguistic Analysis of ‘Foreign Tribunal’, so I’m very glad to be able to pass along this item from Prof. Phillips:

Yesterday in oral argument in ZF Automotive US, Inc. v. Luxshare, LTD, the Supreme Court discussed a paper we recently wrote. In that paper we performed corpus linguistic analysis to see how the term “foreign tribunal” was used around the time it was inserted into the statutory provision at issue in the case. And a couple of the justices expressed uncertainty about relying on our findings.

Chief Justice Roberts conceded, “I don’t quite know what to make of that. That’s … something new. I mean, have we relied on that source before?” In response, counsel answered that the Court had engaged in this type of methodology in a case called Muscarello, where the majority opinion surveyed the use of the verb “carry” in New York Times articles. To which the majority the Chief Justice asked, “[H]ave I ever done that before?”

Counsel replied that the Chief Justice’s opinion in AT&T likewise used this type of methodology. Justice Barrett then stated that “the Court has never used the Corpus Linguistics database before.” She noted that two lower courts have—the Sixth Circuit and the Utah Supreme Court—but repeated that “this Court has not.” And she described that what the Court did in Muscarello and the Chief Justice’s opinion in AT&T were both “a more informal survey.”

We have several responses to this colloquy (and we note petitioners’ counsel did a good job describing and defending corpus linguistics).

First, while it’s true that the Chief Justice has never relied on one of the specific “sources” we did, and that the Court has not used “the Corpus Linguistics database before” in a majority opinion, we think that is somewhat like worrying about relying on briefing based on cases found in LexisNexis because the Court has always done its research in Westlaw. The underlying data in these databases of texts, or corpora, comes from some of the very documents the Court has looked at in the past. For example, the Corpus of Historical American English (COHA), hosted by Brigham Young University, includes articles from the New York Times—the very articles the Court was content to rely on its more “informal” corpus linguistic analysis in Muscarello. We think a New York Times article can be instructive in the search for meaning when one searches a corpus and finds that article just as much as when one finds that article on the New York Times’ website.

Second, and relatedly, as the Court has been willing to conduct “a more informal survey,” a more rigorous one needn’t be viewed with uncertainty. Otherwise, that is akin to saying one is fine with asking a handful of one’s neighbors who they will vote for President and from that draw the inference of who will win the election, but one has serious doubts about a large, random, national sample of prospective voters.

The reality is, the Court has long sampled texts of ordinary and legal language use to try and get a sense of how a word or phrase was being used in a particular time period. From Justice Thomas looking at the Federalist Papers to Justice Ginsburg turning to poetry to Justice Kagan citing Dr. Seuss, justices have been willing to perform this type of methodology. We just think that the size and representativeness of the samples of texts looked at the past make it difficult to confidently generalize to the group of people or type of language the Court is interested in, such as the language of ordinary people. Fortunately, corpus linguistics can provide us with the language samples (corpora) and empirical methods (corpus analysis) one needs to be confident about such generalizations.

What is more, we looked at five sources in our study. One was what Justice Barrett referred to as a “Corpus Linguistic database,” COHA, but we got very little data from this corpus. While a second was a corpus—BYU Law School’s Corpus of Supreme Court Opinions of the United States (COSCO-US)—it simply consists of Supreme Court opinions. The exact same search and the exact same analysis—where the search results were just read in context—could have been done in Westlaw. The other three databases were ones the Court has often searched in the past: Westlaw (for federal court opinions), HeinOnline’s Core US Journals (for law review articles), and HeinOnline’s US Code. So if the Court wants to ignore the small bit of our analysis from COHA because a majority opinion has never cited to COHA before, that’s one thing. But we don’t see why it would ignore the analysis from sources it regularly relies on. And our analysis from those other four sources familiar to the Court are sufficiently clear regarding the meaning and use of “foreign tribunal.”

Further, while Chief Justice Roberts or a majority opinion has never cited to one of the corpora of language use or scholarship that uses them, other individual members of the Court have. Justice Thomas used BYU Law School’s Corpus of Founding-Era American English (COFEA) in his dissent in Carpenter v. United States and cited corpus-based scholarship in another case. And Justice Alito has twice cited to scholarly articles that rely one of these corpora in separate opinions he’s written.

Probably more than the specific source, there may be questions over the methodology. But as noted above, the Court has long been doing “informal” corpus linguistics without calling it that. And what we did could be replicated in chambers.

For instance, the Chief Justice could have one of his clerks look for the 100 uses of the term foreign tribunal by the Supreme Court in Westlaw just prior to the enactment of the statutory language in 1964. He could then have two clerks independently read each instance. and determines whether the more narrow, government-authority sense or the broader, private/non-government authority sense was being used. He could then compare how often the clerks agreed and what the percentage of each sense was that they found.

It is no wonder, then, that corpus linguistics has been called “Westlaw on steroids.” And the Chief Justice could repeat that same analysis in the US Code, Supreme Court opinions, and law reviews. Additionally, the rest of our analysis can likewise be replicated from our instructions and appendices.

Additionally, some kind of corpus linguistic analysis is much more common in the lower courts than just the two courts named by Justice Barrett, though we recognize she was not trying to provide an exhaustive list. So far we count about three dozen opinions in 22 distinct lower courts, including six US Courts of Appeal, six district courts, and four state supreme courts.

Finally, respondent’s counsel attacks our study. He says it’s self-published, but it will be published by the Virginia Law Review Online and this Court has before cited articles on SSRN before they are officially published by a journal. He claims it was full of gaps, but never describes what those are. He says “it’s inconsistent whether there were two or three coders,” but it’s not clear that matters and it’s actually clear that for each analysis two coders looked at the material (not always the same two coders).

And he claims that “all it ends up doing is establishing that the phrase didn’t really have a meaning as of 1964. They only were able to come up with a couple of hundred usages ever.” Both statements are false. As our study made clear, the phrase did have a dominant meaning as of 1964: the narrow, government-authority sense. And we only analyzed 259 uses because when we found hundreds in a specific corpus or database we sampled those uses closer in time to 1964—the year the term “foreign tribunal” was adopted in the statute in question. As we spelled out in the paper, we found thousands of uses of the term, but focused on the more chronologically relevant ones. We would note, 259 uses of “foreign tribunal” is exponentially more than respondents put forth (or traditionalers) or than the Court relies on when it has performed more “informal survey”[s].”

Corpus linguistics can do what dictionaries cannot—namely analyze words and phrases and show which meaning is probable in a given context. Just as the Court and the legal moved on from paper copies of world reporters to searching online databases of cases, we think the Court should supplement its dictionary use with corpus linguistic analysis, especially when presented by a party. The Court will then be able to have increased confidence that it’s uncovering the ordinary (or legal) meaning of terms used in legal texts.

