Hi! It's been awhile. I lost motivation for a while there, and moved to a different state, but I won't keep boring you with all that. Suffice it to say that hopefully I will start posting more regularly now as I head into my PhD!
Now, this post will be about what I learned and created this summer during my internship at Stottler Henke (SHAI). In general, I don't plan on discussing such (gasp) real-world applicable things as much going forward on this blog, in case anyone was concerned.
Now, let's talk about plagiarism. Here's the setup: we have a (very) large corpus (say, Wikipedia) of $N$ "sensitive" or "source" documents, which may or may not have been plagiarized by our query document. When we say plagiarism, we don't necessarily mean the whole document -- it could just be that a key sentence was copied, or a paragraph with crucial information was summarized at some (potentially random) point within the query document. Now, for exact matches, this is a relatively straightforward task, although it takes some clever data structures that I won't discuss in this post to get the runtime to depend only (linearly) on the length $m$ of the query document rather than $N$ (all the source documents we have to search through). This is what had already been accomplished when I arrived for my internship, and my task was to see what might be possible when our job was not just to find exact matches, but also "fuzzy" matches. After all, a smart plagiarizer would try and hide their illicit behavior. But what does "fuzzy" even mean? So glad you asked, and if you have a good definition please send it my way.
Nevertheless, onwards. It seemed as good a starting place as any was the PAN Conference, a conference whose central theme in 2011-2015 was plagiarism detection! Now, this seemed to be more of a competition and less of a thoroughly peer-reviewed venue, but nonetheless there were some interesting ideas floating around there, in case you want to see other things outside of what I will talk about in this post. In addition, they had a nice (albeit small) dataset and evaluation framework (which were a bit difficult to find but not impossible) that allows for an objective scoring and evaluation of any algorithm one might come up with! Beyond the dataset (and floating ideas), and perhaps most importantly, they also had a very nice framework for plagiarism detection, which I will now describe.
The framework they describe splits plagiarism detection into two main tasks:
I think some discussion of the source retrieval step is warranted here, but I will try and keep this section brief and focus more on the algorithms I think are interesting for the text alignment phase.
Basically, we just need to figure out how to input/give our query document to whatever search engine we are using. The search engine I was tasked with using had a finite limit on how many words the query could be, which is part of what makes things slightly tricky about this. Let's talk about some possible solutions:
I am basically going to briefly summarize what I think are the important takeaways from this paper by Wise. It's going to be brief, and I mean it this time. The setup is that we have some source document, and some query document, and we want to find exact matches between them efficiently.
The idea is super simple: just take the longest exact match first, "tile" it (meaning mark it so that no matches touch that area in the source or query document again), and repeat until there are no more matches of at least the minimum match length $t$. That's it. Some notes:
This is probably my favorite algorithm of the bunch. Here's a wikipedia article on it, if you're into that sort of thing, but I will do my best that you don't need to reference it. The original paper is also all of 1.5 pages long, a real banger if you ask me.
My one line description of this algorithm would be "The classic edit distance dynamic programming algorithm but with user-defined scoring". Since we're talking dynamic programming, we fill in an $m+1 \times n+1$ matrix $H$ of scores, where $m$ is the length of the query, and $n$ is the length of the source. We reward matching words, and penalize mismatching words and whenever we have to introduce a "gap", meaning skip over a word in the query or source document. $$H_{i,j} = \max \begin{cases} H_{i-1,j-1} + \mathbf{score}(query\_words[i], source\_words[j]) \\ H_{i-1,j} + \mathbf{gap\_penalty} \\ H_{i, j-1} + \mathbf{gap\_penalty}. \end{cases} $$ We then take matches by finding entries of the matrix above a certain score threshold $\beta$. Note that we do this on a word by word basis, as discussed above. Also note that the runtime is at least $O(mn)$, which is slow. I admit that when I initially saw this runtime, as a theoretician, I was like oh good it's polynomial! What a fast algorithm, this is perfect. Then I ran it on my laptop and realized why I usually stick to theory.
Next, I claim that this algorithm, like the edit distance algorithm, is optimal! But what does that even mean here? And what are reasonable score and gap_penalty functions?? Great questions. The answer to the first question is rather dumb: the algorithm is optimal for whatever score/gap penalty functions you choose, so you could choose some nonsense functions and your algorithm will output nonsense. At least it will be optimal nonsense, I suppose. So let's talk about reasonable scoring and gap penalty functions.
An example of a really simple (and canonical) score function could be as follows: $\mathbf{score}(w_1, w_2) = 1$ if $w_1 = w_2$, otherwise $\mathbf{score}(w_1, w_2) = -p$. Usually $p$ will be in the range of roughly $(0.05, 0.5)$. Taking $p$ too large doesn't allow for hardly any mismatches (aka "fuzziness"), while $p$ being too small will result in very fuzzy matches, to the point where a human would usually conclude that this is not a match. As for the gap penalty function, I think a good rule of thumb is to penalize gaps and misamtches the same, so $\mathbf{gap\_penalty} = -p$.
Now, I wasn't exactly satisfied with this score function, and so I had what I thought was a really cool idea. My idea was to use a (pretrained) word embedding as a score function to capture semantic similarities! $ \mathbf{score}(w_1, w_2) = $ $$ \begin{cases} -p & \text{ if } \mathsf{cs}(E(w_1), E(w_2)) < \alpha \quad \text{(mismatch)}\\ 2\cdot (\mathsf{cs}(E(w_1), E(w_2)) -\alpha) & \text{ if } \mathsf{cs}(E(w_1), E(w_2)) \geq \alpha. \quad \text{(match)} \end{cases} $$ where $\mathsf{cs}$ is the cosine similarity, and $E$ is the pretrained word embedding. For concreteness, just consider $\alpha = 0.5$. What this score function does is it sees for example $w_1 = amazing$ and $w_2 = incredible$, then it may not count it as a mismatch, but rather assign a positive score! Otherwise, if we are below the threshold, then it just applies the same constant penalty. I thought this was cool and would've been surprised if no one had done it before. Turns out, someone has done it before, but I couldn't find anything from before 2020. Here is what some postdoc did last year: link. It's basically the same thing, but he uses a slightly different score function! Anyway, I was pleased with how this score function seemed to work in practice.
You're still here? Incredible. We are now going to do the (perhaps obvious) thing and simply combine Greedy Tiling and Smith-Waterman! In more detail, for each of the $k$ documents retrieved during source retrieval we do the following:
Note also that, by design, the Greedy Tiling phase catches proper nouns/names with at least 2-3 words, but if the surrounding context is unrelated in the source and query document then Smith-Waterman will recognize this and discard the match. I think that's pretty nice! Finally, I'll note that this combination of Greedy Tiling and Smith-Waterman was at least partially inspired by the (apparently famous) BLAST algorithm" from bioinformatics, which does something quite similar.
Finally, I'll note that this is just one way I thought was nice, efficient, and seemed to work well. There are doubtless many other text alignment texhniques you could go for, and a rather simple solution if you have a good exact match finder is to replace each word with a set of synonyms (and count two words as matching if they are in the same synonym set). One could also try and use some neural network approach that "understands" semantic similarities between two passages, but I have the feeling this would have a significantly slower runtime.
Now back to theory!