Minimizing Domain Bias When Adapting
Sentiment Analysis Techniques to the Legal
Domain
Gathika Ratnayaka
198051D
Thesis/Dissertation submitted in partial fulfillment of the requirements for the
degree Master of Science in Computer Science and Engineering
Department of Computer Science & Engineering
University of Moratuwa
Sri Lanka
April 2022
DECLARATION
I, Gathika Ratnayaka, declare that this is my own work and this dissertation does
not incorporate without acknowledgement any material previously submitted for
a Degree or Diploma in any other University or institute of higher learning and to
the best of my knowledge and belief it does not contain any material previously
published or written by another person except where the acknowledgement is
made in the text.
Also, I hereby grant to University of Moratuwa the non-exclusive right to
reproduce and distribute my dissertation, in whole or in part in print, electronic
or other medium. I retain the right to use this content in whole or part in future
works (such as articles or books).
Signature: Date:
The above candidate has carried out research for the Masters thesis/Dissertation
under my supervision.
Name of Supervisor: Dr. Amal Shehan Perera
Signature of the Supervisor: Date:
Name of Supervisor: Dr. Nisansa de Silva
Signature of the Supervisor: Date:
i
ABSTRACT
Sentiment Analysis can be considered as an integral part of Natural Language
Processing with a wide variety of significant use cases related to different ap-
plication domains. Analyzing sentiments of descriptions that are given in Legal
Opinion Texts has the potential to be applied in several legal information extrac-
tion tasks such as predicting the judgement of a legal case, predicting the winning
party of a legal case, and identifying contradictory opinions and statements. How-
ever, the lack of annotated datasets for legal sentiment analysis imposes a major
challenge when developing automatic approaches for legal sentiment analysis us-
ing supervised learning. In this work, we demonstrate an effective approach to
develop reliable sentiment annotators for legal domain while utilizing a minimum
number of resources. In that regard, we made use of domain adaptation tech-
niques based on transfer learning, where a dataset from a high resource source
domain is adapted to the target domain (legal opinion text domain). In this work,
we have come up with a novel approach based on domain specific word represen-
tations to minimize the drawbacks that can be caused due to the differences
in language semantics between the source and target domains when adapting a
dataset from a source domain to a target domain. This novel approach is based
on the observations that were derived using several word representational and
language modelling techniques that were trained using legal domain specific co-
pora. In order to evaluate different word representational techniques in the legal
domain, we have prepared a legal domain specific context based verb similarity
dataset named LeCoVe . The experiments carried out within this research work
demonstrate that our approach to develop sentiment annotators for legal domain
in a low resource setting is successful with promising results and significant im-
provements over existing works.
Keywords: Sentiment Analysis; Deep Learning; Word Representation ; Semantic Anal-
ysis
ii
ACKNOWLEDGEMENTS
First and foremost, I would like to express my sincere gratitude to my super-
visors Dr. Amal Shehan Perera and Dr. Nisansa de Silva for the continuous
support, motivation, and valuable insights which were tremendously helpful for
the successful completion of this work. This research project would not have been
possible without their valuable support.
I would also like to thank Dr. Uthayashanker Thayasivam, Dr. Charith
Chithraranjan for their valuable feedback and advice related to the research.
Moreover, I would like to extend my gratitude to Mr. Gayan Kaviratne, Mr.
Anajana Fernando, Mr. Ramesh Pathirana, and Ms. Thirasara Ariyaratne for
the support given during this research work. I am also grateful for my father Mr.
Dhammika Ratnayaka, my mother Mrs. Geethani Udugamakorale, and my sister
Ms. Akshila Ratnayaka for the continuous support given to me throughout this
journey.
Thank you!
iii
LIST OF ABBREVIATIONS
AI Artificial Intelligence
ANN Artificial Neural Networks
BERT Bidirectional Encoder Representations from Transformers
CBOW Continous Bag of Words
ELMO Embeddings from Language Model
NLP Natural Language Processing
POS Part-Of-Speech
RNN Recurrent Neural Network
RNTN Recursive Neural Tensor Network
SG Skip Gram
iv
LIST OF TABLES
Table 4.1 Frequency Statistics of LeCoVe 19
Table 4.2 Sense2Vec Parameter Configurations 21
Table 4.3 Post training of BERT using criminal court case corpus 23
Table 4.4 Recall (R) and F-Score (F) received for different thresholds of con-
sidered Word2Vec/Sense2Vec models 24
Table 4.5 Recall (R) and F-Score(F) received for different thresholds of BERT
based approaches 26
Table 4.6 Precision(P), Recall (R) and F-Measure (F) received by consider-
ing k most similar words predicted by models 27
Table 4.7 Precision (P), Recall (R) and F-Measure (F) received from differ-
ent approaches based on BERT 28
Table 5.1 Evaluating the word lists generated from Algorithm 1 and Algo-
rithm 2 41
Table 5.2 Precision(P), Recall (R) and F-Measure (F) obtained from the
considered models 42
v
TABLE OF CONTENTS
Declaration of the Candidate & Supervisor i
Abstract ii
Ackowledgement iii
List of Abbreviations iv
List of Tables v
Table of Contents vi
1 Introduction 1
1.1 Background 1
1.2 Research Objectives 4
1.3 Contributions 5
1.4 Publications 5
2 Literature Survey 7
2.1 Sentiment Analysis 7
2.2 Sentiment Analysis in the Legal Domain 7
2.3 Word Vector Representations and Language Modelling Systems 8
2.4 Domain Adaptation 9
2.5 Evaluation Resources on Verb Similarity 10
3 Overall Methodology 13
3.1 Introduction 13
3.2 Overall Flow 13
4 Evaluating Word Representation Techniques Using Verb Similarity 16
4.1 Task Definition 16
4.2 Motivation 17
4.3 Dataset Preparation 18
4.4 Annotation of Verb Pairs 19
4.5 Experiments and Evaluations 20
4.5.1 Evaluation Resources 20
4.5.2 Evaluation of the distributional word representation models 23
vi
4.5.3 Deriving Embeddings for Words using BERT 25
4.5.4 Evaluating models based on most similar words 26
4.5.5 Evaluating BERT models based on most similar words 27
4.5.6 Analysis of Results 28
4.6 Discussion 30
5 Developing a Legal Sentiment Annotator in a Low Resource Setting 31
5.1 Task Definition 31
5.2 Methodology 31
5.2.1 Detecting words that can cause negative transfer 31
5.2.2 Fine Tuning the Recursive Tensor Neural Network Model 39
5.2.3 BERT based Approach for Legal Sentiment Analysis 40
5.3 Experiments and Results 41
5.3.1 Identifying words with deviated sentiments across the source
and target domains 41
5.3.2 Sentiment Classification 42
5.4 Discussion 44
6 Conclusion and Future work 46
References 48
vii
Chapter 1
INTRODUCTION
1.1 Background
Law and order are an integral part of human civilization. The legal systems have
been evolved for centuries in order to match up with the emerging requirements of
human civilizations. As a result, the accessibility of resources related to the legal
domain is becoming more and more important. The World Wide Web enabled
humans to make publishable legal resources easily accessible by digitalising them
and publishing them on the internet. With the emergence of Artificial Intelli-
gence related technologies such as machine learning and deep learning, it can be
seen that there is an emerging trend to develop more sophisticated applications
that can organize and extract valuable legal information in a useful manner with
minimum human intervention.
A given Legal Opinion Texts may contain information which are potentially
applicable in cases which have legal scenarios similar to the scenario that is consid-
ered in the Legal Opinion Text. More precisely, the related incidents, arguments,
legal opinions and legal judgements are some of such information that can be
used in a new similar legal scenario. As a result, legal officials make use of the in-
formation available in legal opinion texts to support their arguments related to a
particular legal situation. Therefore, the development of automated systems that
have the capability to support legal officials by extracting valuable information
from legal documents such as legal opinion texts can be regarded as an impactful
task.
This work is specifically focused on developing techniques to analyze senti-
ments in the descriptions that are available in Legal Opinion Texts. Sentiment
analysis is a well known information extraction task that has several use cases
over many domains. It can also be considered as an important but an under
1
explored information extraction task in the legal domain. When a legal case is
considered, two major parties can be identified. One party bring up the lawsuit,
and that party is commonly identified as the plaintiff. The opposition party to the
plaintiff is usually called the defendant. The legal opinion texts usually contain
descriptions about the ways in which parties are related to a specific incident, the
actions performed by the related parties on a considered event and also about the
arguments brought forward by each party when the legal case was proceeding.
More importantly, legal opinion texts also contain Legal opinions or the opinions
of judges related to a court case. Such opinions may have a direct impact on a
party involved in a court case in a positive, neutral or negative manner. If some
legal opinion has a positive impact on the p. Performing sentiment analysis on
these descriptions will enable the automatic identification of the type of impact
a particular precedent, statute, legal opinion, incident or an argument may have
on a considered party. This can also be considered as a key step when developing
systems that are capable of predicting the outcome of a court case.
In addition to the opinions that are directly related to the conduct of the
parties, legal opinion texts also provide interpretations related to the previous
judgements and also on statutes that are relevant to the legal case. Such opin-
ions may elaborate on the justifications, purposes, drawbacks and loopholes that
are associated with a particular statute or a precedent. Moreover, the descrip-
tions also contain information related to the proceeding of court cases such as
adjournment of the case and lack of evidence which can be considered as factors
that can directly have an impact on the outcomes. For example, let’s consider
the Sentence 1.1 of Example 1 which was extracted from a Legal Opinion Text
[1]. It can be seen that the description in Sentence 1.1 is favorable to Lee, who is
the subject of that sentence. So, Sentence 1.1 has a sentiment which is positive
to the subject of that sentence. If we consider Sentence 1.2, which is obtained
from the same Legal Opinion Text [1], it can be seen that the description of the
sentence is unfavorable to the subject of the sentence (The Government) and has
a negative sentiment towards it.
2
Example 1
• Sentence 1.1: Lee has demonstrated that he was prejudiced by his
counsel’s erroneous advice
• Sentence 1.2: The Government makes two errors in urging the
adoption of a per se rule that a defendant with no viable defense
cannot show prejudice from the denial of his right to trial.
When all of the above mentioned factors are considered, sentiment analysis
on legal opinion texts can be considered as a task that can facilitate a wide
range of use cases. Despite its potential and usefulness, the attempts to perform
sentiment analysis in the legal domain are limited. This study aims to address
this issue by developing a sentiment annotator that can identify sentiments in
a given sentence/phrase extracted from the legal opinion texts related to the
United States Supreme Court. Information that can be derived from such a
sentiment annotator can then be adapted to facilitate more downstream tasks
such as identifying advantageous and disadvantageous arguments for a particular
party, contradictory opinion detection [2], and predicting outcomes of legal cases
[3] .
In order to develop a reliable sentiment annotator using supervised learning,
it is required to have a large amount of labelled data to train the underlying
classification model. However, creating such sophisticated datasets with manually
annotated data (by domain experts) for a specialised domain like legal opinion
texts is not practical due to extensive resource and time requirements [4, 5]. In
a low resource setting, transfer learning can be used as a potential technique
to overcome the requirement of creating a sophisticated data set, by leveraging
information available in a already labelled data from another domain to perform
sentiment analysis in the target domain. The sentiment annotators that are being
widely used with English language are trained using data belongs to domains
such as the movie review domain. Adapting these models directly into the legal
domain will create drawbacks, especially due to the negative transfer; which is
a phenomenon that occurs due to dissimilarities between two domains. Domain
specific usage of words, domain specific meanings and sentiment polarities of
3
words can be considered as one major reason that causes negative transfer when
adapting datasets/models from one domain to another domain [5].
In this thesis, we demonstrate novel techniques that can be effectively utilized
to overcome drawbacks that occur because of negative transfer, when using a
dataset from a source domain (other than the legal domain) to create information
extraction tools to the legal domain. The propose methodologies are facilitated by
an algorithmic approach developed to automatically identify words that can cause
negative transfer when adapting a source dataset to the legal domain. Moreover,
by utilizing the outcomes of the algorithmic approach, we propose two transfer
learning mechanisms that enable the development of legal sentiment annotator
with a minimum amount of resources and human annotations. The sentiment
annotators proposed in this study are capable of performing 3 class sentiment
classification where a given sentence is classified as having a positive or negative
or neutral sentiment.
Our algorithmic approaches to perform sentiment analysis on legal domain
make use of modern word representation and language representation techniques.
Therefore, as a part of this study, we have also carried out extensive experiments
to evaluate the effectiveness of various word embeddings and language represen-
tation techniques in the legal domain.
1.2 Research Objectives
Objectives of this research are as follows:
1. Developing a phrase level sentiment annotator to perform sentiment analysis
on legal opinion texts
2. Coming up with a novel methodology to mitigate the effect of negative
transfer when adapting sentiment analysis datasets from other domains to
the legal opinion texts domain.
3. Evaluate the effectiveness of the word emebdding and language represen-
tation techniques n identifying words with similar meanings in the legal
4
domain.
1.3 Contributions
Within this work, the following contributions have been made:
• Developed a sentiment annotator to analyze the sentiments of legal opinions
in legal opinion texts.
• Proposed a transfer learning based approach to develop a legal sentiment
annotator. Within the proposed approach, there is an algorithmic approach
that exploits domain specific word representation techniques to overcome
negative transfer.
• Developed a verb similarity dataset that provides information related to
the similarity of verbs based on the context it is being used and made it
publicly available to the research community.
• Evaluated the performances of different word representational models con-
sidering the task of identifying verbs with similar meanings in the legal
domain.
1.4 Publications
• Gathika Ratnayaka, Nisansa de Silva, Amal Shehan Perera, and Ramesh
Pathirana, “Effective Approach to Develop a Sentiment Annotator For Legal
Domain in a Low Resource Setting”.
- Conference : 34th Pacific Asia Conference on Language, Information and
Computation (Published).
- CORE rank of the conference: B
• Gathika Ratnayaka, Nisansa de Silva, Amal Shehan Perera, Gayan Kavi-
rathne, Thirasara Ariyarathna, and Anjana Wijesinghe, “Context Sensitive
Verb Similarity Dataset for Legal Information Extraction”.
5
- Journal : Data by Multidisciplinary Digital Publishing Institute (Pub-
lished).
- Rank: CiteScore - Q2 (Information Systems and Management)
6
Chapter 2
LITERATURE SURVEY
2.1 Sentiment Analysis
Early methodologies of sentiment analysis [6] have made use of sentiment lexicons
such as Sentiwordnet [7], ANEW[8], and AFINN[9] to determine the sentiment
of a textual unit. Different domains have been considered when developing such
sentiment lexicons [9]. As a result, it can be observed that the sentiment polarity
of a word and the strength of the sentiment associated with that particular word
change from one sentiment lexicon to another. With the recent development of
machine learning and deep learning, it can be observed that techniques based
on machine learning and deep learning are widely applied for sentiment analysis.
The algorithms/models used in such techniques are developed to automatically
capture the sentiment of a word while learning how the compositions of different
words affect the overall sentiment of a considered text. The Recursive Neural
Tensor Network (RNTN) proposed by Socher et al. [10] is a seminal work in
this direction. The RNTN model has shown promising results for sentiment
classification in the movie review domain. However, more recent approaches
that make use of pretrained language models (eg: BERT[11]) have surpassed the
approaches that are based on recursive neural network architectures, becoming
the state of the art for sentiment classification [12]. From this point onwards, the
RNTN model proposed in [10] will be denoted as 𝑅𝑁𝑇𝑁𝑚.
2.2 Sentiment Analysis in the Legal Domain
Even though the studies related to applying sentiment analysis to the legal do-
main are limited, the ways in which sentiment analysis can be used towards
facilitating legal processes is being discussed in the law-tech community [13, 3].
Gamage et al. have proposed a methodology [4] to perform phrase level sentiment
7
analysis in US legal opinion texts. However, certain limitations that occur when
applying their methodology can be identified. The sentiment annotator proposed
by Gamage et al [4] focuses only on two sentiment classes, i.e. negative sentiment
and non-negative sentiment, which can also be considered as a binary classifica-
tion task. Identifying words that have different sentiments in the legal domain
when compared to that of the movie domain can be considered as one of the key
step of the method proposed in [4]. However, the study [4] uses a manual ap-
proach to identify words with domain-specific sentiments and the identification
of such words had been performed manually by human annotators. However,
such a manual setting is not ideal in a low resource context, as manually going
through a set of words with a significant size is tedious and sometimes infeasible.
In this work, our intention is to come up with a methodology that uses a limited
amount of human annotations to develop a reliable sentiment annotator for the
legal domain while not compensating the accuracy . The study by Sharma et
al. [5] proposes an automatic approach that is based on word representations to
minimize negative transfer. The key insight is to identify transferable words that
can be used for cross domain sentiment classification. However, the approach
proposed in the study [5] aims only at binary sentiment classification, i.e positive
and negative sentiment classes.
2.3 Word Vector Representations and Language Modelling Systems
In order to provide the capability to computers to understand human language or
to extract useful information from natural language text, the textual information
should be converted into a machine readable format. Therefore, one of the main
requirements in Natural Language Processing is to convert a word or a text into a
numerical representation. There are several word representation techniques that
have been developed while taking the semantic, syntactic and contextual prop-
erties of words into the account. Such techniques have proven useful when it
comes to identifying similarities between words. Word2vec [14] and Glove [15]
can be considered as examples for Neural Word Embedding approaches that cre-
8
ate distributional similarity based representations for words. However, one key
drawback in most of these approaches is that they provide only one representation
for a word. However, the same word can have many meanings/senses based on
the context and also based on the considered domain. Sense2vec [16], while being
a distributional similarity based word embedding technique similar to Word2Vec
and Glove, attempts to provide multiple representations for a word based on the
Part of Speech tagging of the considered word. However, Sense2Vec and other
approaches that are based on distributional similarity to create word represen-
tations do not consider the context associated with a word when providing an
embedding/representation for a particular word. Consequently, these word rep-
resentation approaches would not be able to capture how the meaning of a word
changes in relation to context and the domain. This drawback is addressed in
Language Modelling techniques such as BERT [11],ELMO [17], and XLNet [18]
in which the sequential context associated with a word is considered. The models
that are pre-trained based on such language modelling techniques can be used
to obtain context based representations for words. Moreover, such models have
become an integral part of most of the state of the art techniques related to many
Natural Language Processing tasks.
2.4 Domain Adaptation
Transfer learning attempts to adapt models that are trained on one task (source
task) to another task (target task). Existing literature demonstrates that draw-
backs are common when adapting models trained using data from prior tasks
(sources) to a low resource task (target) [4]. If we consider information extrac-
tion tasks, a model trained to perform a particular information extraction task for
a considered domain may not work well for the same information extraction task
in another domain. For example, it has been shown that the sentiment analysis
models that are trained using movie reviews (movie review domain) creates draw-
backs when they are adapted to the Legal domain [4]. Such drawbacks are mainly
due to the dissimilarities between the sources and target that ultimately hinders
9
the performance of the adapted model for the target domain. This phenomenon is
known as Negative Transfer. Domain-specific behaviors of words (domain-specific
terminology) and domain-specific semantics such as relationships between con-
cepts/entities are considered to be major reasons that cause negative transfer
when it comes to text classification tasks. However, it is still the case that trans-
fer learning overall has positive effects. The current state of the art in most text
understanding tasks uses pre-trained language models such as BERT [11] which
allow general transfer of word knowledge. Then, this knowledge is transferred to
perform specific tasks [12].
Active Learning can be considered as another domain adaptation strategy,
which aims to significantly minimize the resources needed to perform data anno-
tations by automatically querying data instances that are most informative for
a learning model. For example, if we consider a domain adaptation task, the
objective of active learning will be to find data instances that will best train
a considered model the domain specific behaviors of the target domain. Those
selected instances will then be annotated by domain experts, but the number
of data instances that are needed to be annotated will be significantly reduced.
When it comes to active learning, there are various querying strategies that are
developed in order to identify the most important data instances to be annotated
[19]. Another technique that can be used in low resource tasks is Data Augmen-
tation [20]. In data augmentation, the objective is to increase the amount of
training data by adding synthetic data that are created using the existing data.
2.5 Evaluation Resources on Verb Similarity
It is needed to evaluate the applicability of different word representation/language
modelling techniques in the legal domain. Io that regard, we evaluated how
different word representation and language modelling techniques perform when
identifying verbs with similar meanings in the legal domain. The similarity mea-
sures that can be derived from these techniques can be used to determine how
close the words are in the embedding spaces created by these word representation
10
techniques. The study [2] describes an approach that can be used to classify verb
pairs as verbs with similar meanings or not, by using a threshold based on the
similarity value of the two given word. However, suitable data sets (evaluation
resources) are imperative to identify such a threshold based on semantic similar-
ity to classify a given verb pair as similar or dissimilar. Though the resources and
datasets that provide information related to semantic similarity between words
in the legal vocabulary are limited, the importance of developing such publicly
available resources is discussed in recent literature related to computational legal
reasoning [21]. In relation to this research direction, a study by Sugathadasa et
al. [22] describes how word embedding techniques such as Word2Vec and tra-
ditional lexicon based semantic similarity methods can be combined to develop
a more reliable legal domain-specific semantic similarity measurement. Their
approach has been utilized in several legal information tasks such as ontology
population [23, 24], deriving representative vectors [25] and to retrieve similar
documents [26].
Though there are evaluation resources such as SimLex-999 [27] and SimVerb-
3500 [28], in which the similarity between verbs are annotated, those resources
have not considered the impact the surrounding context can have on a considered
word or a verb. Moreover, the contextual information related to the verbs is not
available. It can create issues in interpreting the sense of a verb. The lack of
contextual information will also limit the evaluation of models that are pretrained
using language modelling techniques such as BERT. A dataset that has been
developed while considering the context (based on the sentences) when annotating
the similarity of two words is provided in the study [29]. However, the dataset[29]
consists of only 399 verb-verb pairs. All these datasets [27, 28, 29] are focused on
providing a rating for word pairs based on their similarity, but not on classifying
them as similar or dissimilar. Also as these datasets were not prepared focusing
the legal domain, the use of these resources to analyze the behavior of word
representation techniques in the legal domain might create drawbacks. In order
to overcome these issues and limitations, in this work we have introduced LeCoVe
, which is a context based verb similarity dataset prepared considering the legal
11
domain.
12
Chapter 3
Overall Methodology
3.1 Introduction
The aim of this Chapter is to describe the overall flow of the approach proposed in
this thesis to develop sentiment annotators for the legal domain. While explaining
the flow of the overall methodology, this chapter also explains how the works that
is described in Chapter 4 facilitates the approach described in Chapter 5 to achieve
the ultimate objective of developing legal domain specific sentiment annotators
for the legal domain.
3.2 Overall Flow
As shown in Figure 3.1, the data that is needed to develop and evaluate the
legal context sensitive verb similarity dataset (which is described in Chapter 4)
as well to fine-tune and evaluate legal domain specific sentiment annotators were
extracted from the Legal Opinion Text corpus that is available in the SigmaLaw
dataset [22]. The main objective of this study is to develop sentiment annotators
for the Legal domain with minimal use of resources using transfer learning. In
that regard, we make use of the already available models and datasets related to
sentiment analysis in the movie review domain as the source models and source
datasets respectively. In the process, one of the key steps is to identify words
that have a Legal domain specific sense or meaning. The notion of legal domain
specific meaning can be elaborated in the following manner. If a sense or meaning
of a considered word in the Legal domain is different from that of the source
domain (Movie Review domain), such a word will be known as a word with a legal
domain specific sense (domain specific word). Otherwise, the word will be known
as a domain generic word. We have come up with an approach to distinguish
domain specific words from domain generic words using domain specific word
13
Figure 3.1: Overall Flow of the Project
representation models. The approach is described in a detailed manner in Chapter
5 and it can be briefly described as follows. For a given word w, we take the most
similar word (l(w)) for w from a legal domain specific Word2Vec model. Similarly,
the most similar word (m(w)) for w is taken from a movie review domain specific
Word2Vec model. Then, the similarity value between l(w) and m(w) is derived
from a legal domain specific word embedding model. Then, if this similarity
value between l(w) and m(w) is greater than or equal to a particular threshold,
w is considered as domain generic. Otherwise, the word w will be considered
as domain specific. To determine this threshold value which will be used to
distinguish domain specific words from domain generic words, we made use of the
14
observations that were derived during our attempt to automatically identify verbs
with similar meanings using the legal context sensitive verb similarity dataset
LeCoVe .
After identifying domain generic words and domain specific words, the legal
domain specific sentiment of each word from the selected vocabulary is decided
through an automated algorithmic approach that is developed within this study.
The approach is described in a detailed manner in Chapter 5. After identifying the
legal sentiments of words, we propose two mechanisms to develop legal sentiment
annotators in a low resource setting. The first approach is a mechanism to adapt
an existing model from the source domain (movie review domain) by changing
the embeddings of words that have a different sentiment in the legal domain when
compared with the sentiment in the movie review domain. The other approach
is to modify the existing datasets related to sentiment analysis in movie review
domain. Then, the modified dataset is used to train sentiment annotators for the
legal domain. The methodologies related to these two approaches are described
in a detailed manner in Chapter 5. Additionally, Chapter 5 also explains the
experimental settings that were used to evaluate the proposed two approaches
with the corresponding empirical results obtained after the experiments.
15
Chapter 4
EvaluatingWord Representation Techniques Using Verb Sim-
ilarity
4.1 Task Definition
In our proposed approach to minimize negative transfer when developing a senti-
ment annotator for the legal opinion domain (target domain) using datasets from
another domain (source domain), identifying words that have different senses
(meanings) across the two domains was an integral step. In that regard, we
decided to make use of domain specific word embeddings and to evaluate the
effectiveness of various word embedding techniques we have created a context
sensitive verb similarity dataset for the legal domain.
To evaluate the effectiveness of different word embedding methods in identi-
fying words with similar meanings, we focused on the task of identifying verbs
with similar meanings in the legal domain. We choose verbs specifically because,
• Verbs are very important to understand meanings of sentences as they have
a significant impact on the meaning due to their semantic and syntactic
properties [30, 31, 28].
• The argument structure of verbs is pivotal for many legal domain related
natural language processing tasks such as argument extraction [32], senti-
ment analysis[4] , discourse analysis and role labelling.
• Verbs are instrumental to understand the semantics of an event and how
different parties are connected to a particular event [28](In legal opinion
texts, much emphasis is given to the events/incidents related to the partic-
ular court case and involved parties).
16
4.2 Motivation
As described in Section 2.4, most of the existing evaluation resources includ-
ing SimVerb-3500 are focused on rating semantic similarity between two words,
rather than explicitly rating whether two words in a word pair are having a sim-
ilar meaning or not. Additionally, in most of the current evaluation resources,
the context has not been considered when rating the similarity between verbs.
However, the sense of a word may change based on the context. For example,
consider the sentences given in Example 2.
Example 2
• Sentence 2.1: Michael moved to United Kingdom.
• Sentence 2.2: Michael returned to Thailand.
• Sentence 2.3: Michael returned the balance to the customer.
If we consider the verb moved in Sentence 2.1 with the verb returned in
Sentence 2.2, the senses of both words are related to the mobility. But, the verb
returned in Sentence 2.3 has a sense of giving back. This example demonstrate
the impact of context on a meaning of a word.
Moreover, language modelling techniques such as XLNet [18], BERT [11] and
ElMO [17] have surpassed traditional word embedding approaches (Word2Vec [14],
Sense2Vec [16]) and have become the state of the art in several natural language
processing tasks. However, in order to reap the maximum benefit from these
language modelling techniques, it is important to take the context into the con-
sideration when evaluating the similarity between two textual units.
Another important factor that determine the meaning of a word is the domain
that is related to a document or a text. The verb plea suggests a behaviour of
requesting in day to day context while in the legal domain, the same word often
suggests a behavior of stating guilt or innocence. Therefore, it is important
to prepare domain specific datasets to the legal domain in order to carry out
comprehensive evaluations on behavior of words.
The context based verb similarity dataset LeCoVe was developed using legal
17
opinion texts related to United States criminal cases in order to overcome the
above mentioned limitations in existing work (The dataset is publicly available
at https://osf.io/bce9f/).
4.3 Dataset Preparation
The criminal court cases for the dataset were obtained by randomly picking crim-
inal court cases from the publicly available legal opinion text corpus of the Sig-
maLaw dataset. Next, the sentences were extracted from the legal opinion text
documents. Then, the sentences were split using Stanford CoreNLP [33]. The
verb pairs were obtained from the sentence pairs (one verb from one sentence).
When creating the sentence pairs, the sentences that are adjacent or only one
sentence apart from each other in a legal opinion text were chosen. Such an
approach was followed because it can be problematic to understand the context
when the sentences are far away from each other. Given a sentence pair, the
sentence that appears first in a legal opinion text is known as the target sentence.
The other sentence in the same sentence pair is known as the source sentence.
Stanford CoreNLP PoS Tagger [34] was used to extract verbs from the sen-
tences in a given sentence pair. Two lists were used to separately maintain the
verbs from the source sentence and the target sentence. Verbs that are lemma-
tized into be or have were removed from the lists. Then the Wu-Palmer similarity
scored [35] of each possible verb pair that can be formed by taking one verb from
the target list and one verb from the source list was considered. Wu-Palmer sim-
ilarity score between given two verbs is greater than 0.75, such a verb pair was
added to the dataset. Otherwise, the verb pair was not included to the dataset.
This step was taken as a measure of maintaining a proper balance between verb
pairs with similar meanings and dissimilar meanings [2]. When a verb pair was
chosen to be added to the dataset, the sentences that were used to extract the
verb pair were also included to the same dataset. More precisely, these dataset
contains information about target sentence, source sentence, target verb, source
verb, the lemmatized form of the target verb and the lemmatized form of the source
18
verb.
4.4 Annotation of Verb Pairs
As the first step of the annotation process, all the human annotators were pro-
vided with a proper understanding of the two classes (Similar,Dissimilar) to which
the verbs will be classified. A sets of examples that contains pre-identified data
points related to each class were used to provide this understanding for the an-
notators. Next, the understanding of the annotators were further tested by dis-
cussing the thought process related to the annotation of randomly selected ex-
amples. Then, the human annotators were instructed to annotate each verb pair
based on their similarity. More precisely, each verb pair was annotated either
as a verb pair with similar meaning or as verb pair with dissimilar meaning.
When providing the annotation, the annotators were instructed to interpret the
meanings of the verbs while taking the context into the consideration using the
corresponding sentences. The annotators were instructed to mark 1 for similar
verb pairs and mark 0 for dissimilar verb pairs. Annotators were also instructed
to give a score from 1 to 10 per each annotation, based on how confident they
are on their annotation for the considered verb pair. The key statistics that have
been identified after the annotation process is shown in Table 4.1 .
Table 4.1: Frequency Statistics of LeCoVe
Feature Number of Verb Pairs
Two verbs with similar meaning
(agreed by 3 human annotators) 170
Two verbs with similar meaning
(agreed by atleast 2 human annotators) 285
Two verbs with similar meaning
(agreed by atleast 1 human annotator) 463
Verb Pair with Same Lemmatized Form, but different meaning
(considering majority agreement) 6
Verb Pair with Same Lemmatized Form and similar meaning 144
Number of unique verb pairs (lemmatized form) 714
19
4.5 Experiments and Evaluations
The annotation of verb pairs was performed by four human annotators. However,
a given verb pair is annotated only by three human annotators. As a result, the
annotators who annotate one pair may be different from the annotators of another
pair. Therefore, the inter-rater reliability of the annotation process was measured
using Fleiss’ kappa [36]. A kappa value of 0.57 was observed. As interpreted in the
study[37]), the kappa value of 0.57 falls into the range of the moderate agreement
level.
As the next step, models created using different Word Representation Tech-
niques were evaluated using The annotated dataset (LeCoVe ). The evaluation
was performed in order to get a proper understanding of the ability of these models
to identify whether a given two verbs have a similar meaning in the legal domain
or not. Such evaluations can also be used to get an idea of the effectiveness of
the considered models in the legal domain.
4.5.1 Evaluation Resources
This section provides a detailed description of the models which have been eval-
uated using LeCoVe .
Word2Vec Models
We considered three word2vec models available in the SigmaLaw dataset[22] (Sig-
maLaw dataset can be found at https://osf.io/qvg8s/). These three models have
been trained using a corpus of legal opinion text.
• Word2Vec (LR) - Trained using raw legal opinion text corpus.
• Word2Vec (LL) - Trained using lemmatized legal opinion text corpus.
• Word2Vec (LLR) - Trained using lemmatized legal opinion text corpus and
then enhanced for lexical similarity.
20
The word2vec model which has been trained using Google news corpus by
Google and is publicly available is also considered for the evaluations. From this
point onwards, Google news word2vec model will be denoted by Word2Vec (G) .
Sense2Vec Models
Word2Vec provides only one representation for a given word. However, Sense2Vec
provides multiple vector representations for a single word. In other words, the
noun form and the verb form of a word will be provided with the same rep-
resentation by Word2Vec. But, a Sense2Vec model will provide two different
representations for the noun form and the verb form of a word. In order to train
the Sense2Vec models, each word in the legal opinion text corpus available at
SigmaLaw [22] was lemmatized. Then, the POS tags related to each of the lem-
matized words were appended behind the each considered word.Spacy1 was used
to obtain the POS tag of the words. Using the modified corpus, three Sense2Vec
models were trained2. Table 4.2 illustrates the key parameters that were used in
the training of the Sense2Vec models.
Table 4.2: Sense2Vec Parameter Configurations
Parameter SG-2 CBOW-10 SG-10
Model Skip-gram CBOW Skip-gram
Size (Dimensionality) 128 128 128
Min. Count 5 10 10
Context Window Size 5 10 10
Training Algoriithm Negative Sampling Hierarchical Softmax Negative Sampling
Number of Iterations 2 10 10
Moreover, the publicly available Reddit Vectors 1.1.0 Sense2Vec model was
considered in our experiments. From this point onwards, Reddit Vectors 1.1.0
model is also denoted as Sense2Vec(R).
1https://spacy.io/
2The Sense2Vec and BERT models developed in this study are available at
https://osf.io/s8dj6/
21
BERT
BERT [11] is a popular language modeling technique that is being used for many
NLP tasks. Unlike the word embeddings provided by Word2Vec/Sense2Vec, the
representation that BERT provides for a word can vary based on the surrounding
context of the word. The publicly available pre-trained BERT model (’bert-base-
uncased’) was used in our experiments. Moreover, we made use of the implemen-
tation mechanisms provided by Transformers 3 library.
The pretrained BERT model (’bert-base-uncased’) which is publicly available
was trained using a very large Wikipedia corpus and a book corpus. In order to
post train the BERT model, a corpus was created using the Criminal Court Cases
available at the SigmaLaw dataset. Following the instructions for BERT training
as provided in BERT implementation repository by Google Research 4, the legal
corpus was modified to suit the post training of the BERTmodel. From this point,
the BERT implementation by Google Research will be denoted as BERT(G). Only
the sentences with more than 4 tokens were considered for the training corpus.
This step was followed to overcome the issues that can be occurred when splitting
the sentences. The text data set prepared for post training of BERT consists of
90851 sentences. Then, the prepared text dataset was used to post train the
’bert-base-uncased’ model. In the training phase, BERT is designed to learn
two tasks. The first task is masked language modelling. In masked language
modelling, the task is to predict the tokens which are masked. The second task
is next sentence prediction. In the post training phase of the ’bert-base-uncased’
model using the legal text data, the performances of the post trained model after
one training iteration and 500 training iterations were observed. The observations
are included in the Table 4.3. The observations demonstrate that the accuracy of
the BERT model for the legal text data is low after the first iteration. However,
there is a significant improvement in accuracy when the model is further trained
for 500 training iterations. From this point onwards, the BERT model which has
been post trained using legal text data will be denoted by BERT(L).
3https://github.com/huggingface/transformers
4https://github.com/google-research/bert
22
Table 4.3: Post training of BERT using criminal court case corpus
No. of Training Steps Masked LM Accuracy Masked LM Loss Next SentenceAccuracy
Next Sentence
Loss
1 0.55 2.71 0.60 2.47
500 0.70 1.42 0.95 0.14
4.5.2 Evaluation of the distributional word representation models
The steps related to the evaluations of models that are based on Word2Vec or
Sense2Vec are shown below.
• The cosine similarity of the two vectors (vector representations/ embeddings
of the two verbs in the considered verb pair) is calculated.
• Given a verb pair, let U be the embedding of the Source Verb and V be
the embedding of the Target Verb. The cosine similarity between U and V
will be in the range of -1 to 1.
• Linearly scale the cosine similarity between U and V to be in the range
of 0 to 1. Let sv be the value obtained after scaling. Then, sv = 𝑈𝑇𝑉+1
2
).
The sv value was considered as the similarity score between the two verbs
corresponding to U and V .
• After obtaining the similarity score between two verbs using a word em-
bedding model, it is checked whether the similarity score is greater than
or equal to a predefined threshold value. If the similarity score is greater
than or equal to the considered threshold value, it is considered that the
verb pair is classified (by the considered model) as having two verbs with a
similar meaning. Otherwise, the verb pair is considered to be classified as
having two verbs with dissimilar meanings.
• When evaluating Word2Vec models or Sense2Vec models that were trained
using a lemmatized legal opinion text corpus, the lemmatized forms of the
verbs were considered.
• Then the classifications obtained using each model were compared with
the ground truth, which is the human annotations. In LeCoVe , each
23
Table 4.4: Recall (R) and F-Score (F) received for different thresholds of consid-
ered Word2Vec/Sense2Vec models
ModelThreshold 0.60 0.65 0.70 0.75 0.80 0.85 0.90R F R F R F R F R F R F R F
Word2Vec(G) 0.85 0.62 0.75 0.70 0.64 0.71 0.52 0.66 0.45 0.60 0.33 0.49 0.29 0.45
Word2Vec(LR) 0.80 0.65 0.74 0.70 0.68 0.72 0.60 0.70 0.53 0.66 0.41 0.57 0.33 0.49
Word2Vec(LL) 0.80 0.51 0.72 0.57 0.62 0.62 0.55 0.66 0.52 0.67 0.51 0.67 0.51 0.66
Word2Vec(LLR) 0.79 0.65 0.74 0.70 0.66 0.72 0.62 0.73 0.60 0.72 0.55 0.70 0.54 0.69
Sense2Vec(R) 1.00 0.46 1.00 0.47 0.98 0.50 0.86 0.54 0.71 0.62 0.58 0.64 0.52 0.67
Sense2Vec(SG-2) 0.92 0.55 0.83 0.58 0.78 0.64 0.70 0.67 0.65 0.71 0.62 0.72 0.56 0.70
Sense2Vec(CBOW-10) 0.94 0.53 0.89 0.63 0.81 0.69 0.68 0.69 0.64 0.72 0.61 0.72 0.59 0.72
Sense2Vec(SG-10) 0.82 0.62 0.75 0.66 0.71 0.69 0.65 0.74 0.61 0.73 0.56 0.70 0.55 0.69
pair of verbs was annotated by three human annotators. The class (Sim-
ilar/Dissimilar) of a verb pair is determined based on the majority agree-
ment, i.e. the class agreed by at least two human annotators.
• Each of the considered models was evaluated while varying the value of
the pre-defined threshold that is used to classify a verb pair as similar or
dissimilar.
• The evaluations were carried out in relation to the Similar class because the
intention of this experiment was to evaluate the capability of the considered
models to identify the verb pairs with similar meanings.
Table 4.4 provides the results obtained from the evaluations. Precision and
Recall values were calculated as follows. Let C be the number of verb pairs
classified by the system as having verbs with similar meaning and let D be the
number of verb pairs classified as having verbs with similar meaning according to
human annotations. Then,
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝐶 ∧𝐷
𝐶
(4.1)
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝐶 ∧𝐷
𝐷
(4.2)
24
4.5.3 Deriving Embeddings for Words using BERT
The representation that can be obtained by language modelling techniques like
BERT are also called as Contextual Word Embeddings as the provided represen-
tations consider the context around the considered word. As a result, unlike the
traditional word embedding models like Word2Vec/Sense2Vec (statice vector rep-
resentation), the representation provided by BERT for the same word can change
based on the context (dynamic vector representation). Also, unlike when using
Word2Vec/Sense2Vec, when we try to obtain the representation of a verb using
BERT, it is needed to provide the sentence (context) where the verb resides to
obtain the embedding of that verb.
For our experiments with BERT, we use the ’bert-based uncased’ model. It
consists of 12 hidden layers. In a single layer, each token in a sentence is rep-
resented by 768 hidden units. Promising results have been shown in previous
studies when contextual word embedding is obtained by averaging the represen-
tations provided for the considered word by the last 4 hidden layers. Therefore,
the same methodology was followed in this work to obtain the contextual word
embeddings. However, due to the tokenization mechanism of BERT, contextual
embeddings for some words could not be obtained. Substantiate can be consid-
ered as one of those verbs where the tokenization causes issues to obtain the
embedding.Substantiate is tokenized into sub-tokens sub, ##stan,##tia,##te.
To address this issue, we first identify each sub token of a word and get the cor-
responding vector representation (embedding) for each subtoken. Then, we take
the mean of the subtoken embeddings as the contextual vector representation
for the word. However, there can be situations where a subtoken of a word can
be the lemma of that particular word. In such cases, the contextual embedding
of the subtoken is directly taken as the contextual embedding of the verb. The
results obtained following this subtoken embedding based approach are shown
under BERT(G) Improved and BERT(L) Improved models in Table 5.2. Follow-
ing the above mentioned approach, the contextual embeddings for each verb in
the verb pairs of LeCoVe were obtained from both BERT models, BERT(G) and
25
Table 4.5: Recall (R) and F-Score(F) received for different thresholds of BERT
based approaches
ModelThreshold 0.50 0.525 0.55 0.60 0.65 0.70 0.75R F R F R F R F R F R F R F
BERT(G) 0.80 0.65 0.77 0.69 0.72 0.68 0.65 0.71 0.57 0.68 0.45 0.60 0.34 0.50
BERT(G) Improved 0.85 0.66 0.81 0.70 0.75 0.69 0.67 0.72 0.59 0.69 0.46 0.61 0.38 0.54
BERT(L) 0.72 0.71 0.69 0.71 0.65 0.70 0.58 0.69 0.50 0.64 0.50 0.64 0.38 0.54
BERT(L) Improved 0.75 0.72 0.72 0.73 0.66 0.71 0.60 0.70 0.51 0.65 0.39 0.54 0.27 0.43
BERT(L). Then, both the models were evaluated considering the cosine similar-
ity of the verb embeddings following a similar approach as described in 4.5.2.
However, unlike the approach described in 4.5.2, we considered cosine similarity
values as it is (without performing linear scaling on the cosine similarity values).
The results obtained from the experiments are shown in Table 5.2.
4.5.4 Evaluating models based on most similar words
As another way of evaluating Word2Vec/Sense2Vec models on their usefulness in iden-
tifying verbs with similar meanings, the topmost words predicted as most similar for
given verb can be considered. When a verb pair is considered it consists of two verbs;
source verb (which is taken from the source sentence) and target verb (which is taken
from the target sentence). When evaluating the considered moded using this approach,
following aspects related to source verb and target verb is considered in this approach.
• most similar words source : List of first k words predicted as the most similar
words to the lemmatized form of the source verb.
• most similar words target : List of first k words predicted as the most similar
words to the lemmatized form of the target verb.
• Condition 1 : Lemmatized form of the target verb is in most similar words source.
• Condition 2 : Lemmatized form of the source verb is in most similar words target.
If Condition 1 or Condition 2 is true, the two verbs (source verb and target verb)
will be classified as having a similar meaning. Furthermore, if the lemmatized form of
two verbs are exactly the same, such verb pairs were also considered as having a similar
meaning. The approach described here was evaluated using Word2Vec and Sense2Vec
models in relation to different k values as shown in Table 4.6. Similar to the Section
26
Table 4.6: Precision(P), Recall (R) and F-Measure (F) received by considering k
most similar words predicted by models
k=5 k=10 k=15 k=20
P R F P R F P R F P R F
Word2Vec (LR) 0.83 0.60 0.69 0.78 0.62 0.69 0.77 0.67 0.71 0.73 0.68 0.70
Word2Vec (LL) 0.85 0.55 0.67 0.76 0.60 0.67 0.71 0.61 0.66 0.67 0.61 0.63
Word2Vec (LLR) 0.87 0.63 0.73 0.83 0.66 0.74 0.76 0.67 0.71 0.75 0.68 0.72
Sense2Vec (SG-2) 0.93 0.59 0.73 0.88 0.61 0.72 0.84 0.64 0.73 0.84 0.64 0.73
Sense2Vec (CBOW-10) 0.82 0.63 0.71 0.75 0.65 0.70 0.73 0.68 0.70 0.71 0.69 0.70
Sense2Vec (SG-10) 0.92 0.61 0.74 0.87 0.64 0.73 0.85 0.64 0.73 0.82 0.65 0.73
4.5.2, equation 4.1 and equation 4.2 have been used to measure precision and recall
respectively, where C is the number of verb pairs classified by this model as having
verbs with similar meaning and D is the number verb pairs classified as having verbs
with similar meaning according to human annotations.
4.5.5 Evaluating BERT models based on most similar words
In BERT, the model is trained to correctly predict the tokens in a token sequence,
which have been masked or corrupted. This training mechanism itself can be exploited
to identify verbs which convey similar meaning. However, when following this approach,
it is necessary to consider the sentence which is used to extract the considered verb.
Let moved and returned in Example 3 be the verb pair which is needed to be classified
either as two verbs with similar meaning or two verbs with dissimilar meaning. Here,
Sentence 3.1 is the Target Sentence and Sentence 3.2 is the Source Sentence, making
moved the Target Verb and returned the Source Verb. A BERT based approach can
be used to classify the verb pair using the following procedure. First, the verb moved
in Sentence 3.1 is replaced with token [MASK], thus corrupting the Target Sentence.
Then, the corrupted sentence can be input into the pretrained BERT model5, and the
first k tokens that are predicted by the model as the tokens which should replace the
token which is occupied with value [MASK] can be considered. If returned (which is
the other verb in the verb pair) is in those first k predictions, it can be considered that
returned is also having a significance similarity to moved in the considered context. The
same procedure can be followed with the other sentence as well. Here, a suitable value
for k has to be decided empirically. Our experiments considering different k values has
been provided in Table 3. For each value of k, we have considered four approaches when
5for our experiments we have used https://github.com/huggingface/pytorch-transformers
27
Table 4.7: Precision (P), Recall (R) and F-Measure (F) received from different
approaches based on BERT
[40mm]Approachk value k=10 k=25 k=50P R F P R F P R F
C1 AND C2 0.90 0.10 0.18 0.83 0.18 0.29 0.82 0.24 0.37
C1 OR C2 0.68 0.29 0.41 0.62 0.39 0.48 0.53 0.48 0.50
(C1 OR C3) AND (C2 OR C4) 0.88 0.10 0.18 0.80 0.18 0.30 0.79 0.27 0.40
(C1 OR C3) OR (C2 OR C4) 0.63 0.41 0.50 0.59 0.55 0.57 0.49 0.65 0.55
determining whether a verb pair is similar or not. These approaches are developed
considering different conditions. First condition (C1); source verb is in the first k
predictions when corrupted target sentence is input to the model. Second condition
(C2); target verb is in the first k predictions when corrupted source sentence is input to
the model. Third condition (C3); lemma of the source verb is in the first k predictions
of corrupted target sentence. Fourth condition (C4); lemma of the target verb is in the
first k predictions of corrupted source sentence. We have evaluated different approaches
based on these four conditions as shown in Table 3. In the table, C1 AND C2 suggest
that if both C1 and C2 are satisfied, then the verb pair is considered to be classified by
the system as having two verbs with similar meaning. Other approaches mentioned in
the table can also be interpreted in the same manner. Precision and Recall for each k
value were also calculated as described in equations 4.1 and 4.2.
4.5.6 Analysis of Results
When we consider the results depicted in Table 4.4 related to the Word2Vec and
Sense2Vec word embedding techniques, the following observations can be derived.
• Word2Vec/Sense2Vec models that were trained using the legal corpus (domain-
specific and relatively small corpus) tend to outperform the Word2Vec/Sense2Vec
models that were trained using large corpora which are either domain generic or
belongs to another domain. In other words, the legal opinion text corpus based
Word2Vec (LLR) and Word2Vec(R) models outperform the Word2Vec(G) model
that was trained using a large Google News Corpus. Similarly, Sense2Vec(SG-2),
Sense2Vec(CBOW-10), and Sense2Vec(SG-10) models that were trained from the
legal opinion text corpus perform better than Sense2Vec(R), the sense2vec model
that has been trained using a corpus from Reddit.
28
• Based on the above mentioned results, it can be observed that the word embedding
models that are trained using domain specific corpus tend to perform better in
a considered domain than the word embedding models that are trained using a
domain generic corpus.
• Word2Vec(LLR) model performs better than all other Word2Vec models, agreeing
with the observations of [22].
• Sense2Vec(SG-10) model’s overall performance is better than that of the other
Sense2Vec models. This result shows when the dimensionality of word repre-
sentation vectors and the number of iterations used for training the models are
the same, Sense2Vec models that are trained using skip-gram approach tend to
overperform the CBOW (Continuous Bag of Words) approach.
• The performance of Sense2Vec(SG-10) is on par with Word2Vec(LLR) model.
Lexical resources have been used to enhance the reliability of the lexical similarity
that can be obtained from Word2Vec(LLR) [22]. However, such enhancements
were not performed in Sense2Vec(SG-10).
• As shown in Table 4.4, for different threshold values, different models show the
best F-Scores. Sense2Vec(CBOW-10) model has the highest F-Scores when the
threshold is above 0.8. However, Sense2Vec(SG-10) and Word2Vec(LLR) outper-
form other models when the threshold values are between 0.4 and 0.8. This can
be considered as a good indication that it is possible to use techniques such as
ensemble modelling where several models are combined together to achieve better
performances.
When interpreting the results related to BERT based approaches as shown in Table
4.5, the following observations can be made.
• For all four BERT based approaches, their highest F-Scores are in the range of
0.71-0.73. These results are comparable with the results obtained usingWord2Vec/Sense2Vec
models.
• The results also demonstrate that the improvements introduced in this study for
deriving contextual word embeddins by considering the embeddings of subtokens
29
have enhanced the performances of both BERT(G) model as well as BERT(L)
model.
• The BERT(L) Improved approach, where the BERT model was post trained using
a legal domain corpus and then further improved by considering the embeddings
of subtokens when obtaining the representation for a considered word has outper-
formed all other BERT based approaches in identifying verbs which has similar
meanings in legal opinion texts.
4.6 Discussion
In summary, this chapter described the information related to the preparation of a
legal context based verb similarity dataset LeCoVe and the motives behind developing
LeCoVe . LeCoVe has been made publicly available to the research community. As
described in this chapter, we have also evaluated the performances of several word
representation techniques and language modelling techniques in the legal domain using
LeCoVe . In addition to evaluating existing Word2Vec models developed for the legal
domain, we have also developed new Sense2Vec models focusing on the legal domain
for the evaluations. Additionally, we also post-trained ’bert-base-uncased’ BERT model
using a legal opinion text corpus and using BERT models, we demonstrated that LeCoVe
can be used to unleash the contextual word representations provided by the language
modelling techniques such as BERT. In other words, as LeCoVe provides the context
associated with each verb rather than just providing only the verb, LeCoVe will enable
the evaluation of language representation models that consider the sequential context
associated with a text.
30
Chapter 5
Developing a Legal Sentiment Annotator in a Low Resource
Setting
5.1 Task Definition
In this chapter, it is explained how a sentiment analysis dataset from another domain
(source domain) can be effectively utilized to develop a sentiment annotator for the
legal domain (target domain) while minimizing the negative transfer.
5.2 Methodology
5.2.1 Detecting words that can cause negative transfer
Minimizing resource requirements to develop a reliable sentiment annotator for the
legal domain can be considered as our main objective. To that regard, our intention
is to utilize resources from a source domain (a domain which has adequate amount of
labeled resources) to perform sentiment analysis in the legal domain (a low resource
domain/target domain). As the source dataset, Stanford Sentiment Treebank (SST-5)
[10] was considered. It consists of Rotten Tomato movie reviews annotated according to
their sentiments. The words available in the source dataset can be assigned into three
main categories as shown below.
• Domain Generic words - Words that behave in a similar manner in both the
domain (the movie review domain and the legal domain).
• Domain Specific words - Words that behaves differently in the two domains. In
other words, the most frequently used sense of a word in source domain may be
different from that of the target domain. Thus, these words have the potential to
cause negative transfer. A word belongs to this set may have different sentiment
polarities across the two domains.
• Under Represented Words - The words that occur frequently in the target dataset
31
(legal domain), but occur very less frequently or not available in the source
dataset.
Manual identification of domain specific words, domain generic words, and under
represented words in a dataset by going through each word that is available in the legal
opinion text corpus is not feasible because of the limited human resources. The sequence
of steps that were followed in order to minimize the manual annotations is described
below.
• The stop words in the considered legal opinion text corpus were removed. The
Van stop list [38] was used to identify stop words.
• Word frequency was calculated for each word in the legal opinion text corpus.
Word frequency is the frequency of occurrence of a particular word within the
corpus.
• Then, the words were sorted in the descending order based on their word fre-
quencies in order to obtain the sorted set D. Let 𝑘 = min𝑗{𝑗 ∈ 𝑍+|
∑︀𝑗
𝑖=1(𝑤𝑖) ≥
0.95 ·∑︀𝑛𝑖=1(𝑤𝑖)} given that 𝑤𝑖 is the 𝑖𝑡ℎ element of D and n is the cardinality of
D. Then, the first k words of D were then chosen as the set of words S that will
be considered to identify the words with negative transfer.
Next, the sentiment of each word in S was annotated using the Stanford Sentiment
Annotator (𝑅𝑁𝑇𝑁𝑚). Based on the annotated sentiment by the Stanford Sentiment
Annotator, the words were distributed into three sets 𝑃𝑀 , 𝑁𝑀 , 𝑂𝑀 .
• 𝑃𝑀 - The set of words that were annotated as Very Positive or Positive. The
number of words in 𝑃𝑀 (|𝑃𝑀 |) is equal to 336.
• 𝑁𝑀 - The set of words that were annotated as Very Negative or Negative. |𝑁𝑀 |=
253.
• 𝑂𝑀 - The words that were annotated as having a Neutral sentiment. |𝑂𝑀 | =
4992.
As |𝑂𝑀 | = 4992, it is difficult to manually identify words in 𝑂𝑀 that have different
sentiments across the two domains. A heuristic approach to identify words in 𝑂𝑀 that
have deviated sentiment across the domains was developed to overcome this challenge.
32
Moreover, in our algorithmic approach as described by Algorithm 1, Algorithm 2, and
Algorithm 3, words with deviated sentiments are identified while automatically assigning
each word with a legal sentiment. Note that Algorithm 1, Algorithm 2, and Algorithm
3 are 3 parts of the same algorithm.
Though it is feasible to manually annotate all the words in 𝑃𝑀 and 𝑁𝑀 , we have
developed our algorithmic approach to automatically identify words that can have de-
viated sentiments in 𝑃𝑀 and 𝑁𝑀 as well (Algorithm 3). Such an automated heuristic
approach is useful because it can be used to minimize the number of required manual
annotations. Moreover, such approaches will be useful to the automatic generation of
domain specific sentiment lexicons with minimal human intervention.
We have derived the following two key information from word embedding models in
order to facilitate the method we have developed to distinguish domain specific words
and domain generic words separately.
• 𝐶𝑜𝑠𝑖𝑛𝑒𝑑𝑜𝑚𝑎𝑖𝑛(u, v) - Cosine similarity between the embeddings of two words u
and v.
• 𝑚𝑜𝑠𝑡𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑑𝑜𝑚𝑎𝑖𝑛(w)- The most similar word for a particular word w as given
by the considered word embedding model.
Domain specific word embeddings have been utilized within our approach to identify
domain specific words from domain generic words. The Word2Vec model publicly avail-
able at SigmaLaw dataset [22] that has been trained using a United States legal opinion
text corpus was selected as the legal domain specific word embedding model. The SST-5
dataset does not contain an adequate amount of text data to be used as a corpus to
create an effective word embedding model. Therefore, we selected the IMDB movie re-
view corpus [39] to train the movie review domain specific Word2Vec embedding model.
From this point onwards, the following notations will be used:
• 𝐶𝑜𝑠𝑖𝑛𝑒𝑙𝑒𝑔𝑎𝑙 will be denoted by 𝐶𝑜𝑠𝑖𝑛𝑒𝑙
• 𝐶𝑜𝑠𝑖𝑛𝑒𝑚𝑜𝑣𝑖𝑒−𝑟𝑒𝑣𝑖𝑒𝑤𝑠 will be denoted by 𝐶𝑜𝑠𝑖𝑛𝑒𝑚
• 𝑚𝑜𝑠𝑡𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑙𝑒𝑔𝑎𝑙(𝑤) will be denoted by 𝑙(𝑤)
• 𝑚𝑜𝑠𝑡𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑚𝑜𝑣𝑖𝑒−𝑟𝑒𝑣𝑖𝑒𝑤𝑠(𝑤) will be denoted by 𝑚(𝑤)
33
First, for a given word w, we obtain 𝑙(w) and 𝑚(w). As Word2Vec [40] embeddings
are based on distributional similarity, it can be assumed that the most similar word
output by a domain specific embedding model to a particular word is related to the
domain specific sense of that considered word. For example, convicted is obtained as
𝑙(𝑐ℎ𝑎𝑟𝑔𝑒𝑑). It can be observed that the word convicted is associated with the sense of
accusation, which is the most frequent sense of charge in the legal domain. However,
when it comes to 𝑚(𝑐ℎ𝑎𝑟𝑔𝑒𝑑), sympathizing is obtained as the output. Sympathizing is
associated with the sense of filled with excitement or emotion, which is the most frequent
sense of charged in the movie reviews. After obtaining the most similar words for a
given word w, we define a value 𝑑𝑜𝑚𝑎𝑖𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑤) such that 𝑑𝑜𝑚𝑎𝑖𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑤)
= 𝐶𝑜𝑠𝑖𝑛𝑒𝑙(𝑙(𝑤),𝑚(𝑤)). As we are considering the legal embedding model when getting
the cosine similarity values, a higher domainSimilarity(w) value will suggest that legal
sense and movie sense of the word w have a similar meaning in the legal domain while
a lower domainSimilarity(w) will suggest that the meanings of the two senses are less
similar to each other. For example, the value obtained for domainSimilarity(Charged)
was 0.06 while it was 0.53 for domainSimilarity(Convicted) (convicted has a similar
sense across the two domains).
The next step is to identify a threshold based on domainSimilarity(w) to heuristi-
cally distinguish whether a word w is domain generic or not. To that regard, we made
use of LeCoVe . As described in Chapter 4, our approach to identify verbs with similar
meaning can be briefly described as follows. First, a threshold t based on cosine simi-
larity was defined. For a given two verbs 𝑣𝑖, 𝑣𝑗 , if 𝐶𝑜𝑠𝑖𝑛𝑒𝑙(𝑣𝑖, 𝑣𝑗) ≥ t, the two verbs are
considered as having a similar meaning. We identified that the same approach can be
used to identify whether l(w) and m(w) have the similar meaning. But it was needed
to identify a suitable word representation model and a cosine similarity value as the
threshold. Based on the results that were obtained after the experiments (given in
Chapter 4), Word2Vec (LLR) model was selected because it supports non lemmatized
tokens and also it has outperform or on par with other models (which support non lem-
matized tokens) when it comes to capturing legal sense of words. The cosine similarity
values of 0.2 was selected as the threshold value to identify domain generic words based
on the domainSimilarity(w) score ( Table 4.4 shows the similarity values after linearly
scaling the cosine similarity values to [0,1] range. Therefore, the similarity value of 0.6
in Table 4.4 is corresponding to cosine similarity value of 0.2. From Table 4.4, it can
34
be seen that that precision is greater than 0.5 when the threshold similarity value is
equal to 0.6 (cosine similarity value of 0.2). It was found that the threshold value drops
below 0.5 when the similarity value is equal to 0.55 (cosine similarity value of 0.1)). In
other words, if domainSimilarity(w) is greater than or equal to 0.2, the word w will be
considered as domain generic and the attribute domainGeneric(w) will be set to true.
Otherwise, the attribute domainSpecific(w) will be set to true. Though we have used
the aforementioned approach to determine the threshold, it is a heuristic and domain
specific value that can be decided based on different experimental techniques (when
applying this methodology to another domain).
Even if a word behaves in a similar manner across the two domain, it still can be
assigned with a wrong sentiment (neutral sentiment) due to under representation. How-
ever, it is important to identify words with sentiment polarities (positive or negative)
as the descriptions with positive or negative sentiments tend to contain more specific
information that will be useful in legal analysis. As a measure of identifying sentiment
polarities of under represented words, we made use of AFINN [9] sentiment lexicon (de-
noted as set A from this point onwards), which consists of 3352 words annotated based
on their sentiment polarity (positive, neutral, negative) and sentiment strength consid-
ering the domain of twitter discussions. If a frequency of a word w is less than 3 in the
source dataset, underRepresented(w) is set to true. Assignment of AFINN sentiment for
an under represented word or a domain specific word w can create a positive impact if
the most frequently used sense of w in twitter discusion domain is aligned towards it’s
sense in the legal domain than the sense of that word (w) in the movie review domain. In
order to heuristically determine this factor, we have defined an attribute name afinnSim-
ilarity such that 𝑎𝑓𝑖𝑛𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑤) = 𝐶𝑜𝑠𝑖𝑛𝑒𝑡(𝑤, 𝑙(𝑤))−𝐶𝑜𝑠𝑖𝑛𝑒𝑡(𝑤,𝑚(𝑤)), where w
is a given word and 𝐶𝑜𝑠𝑖𝑛𝑒𝑡 is the cosine similarity obtained using a publicly available
Word2Vec model [41] trained using tweets. If 𝐶𝑜𝑠𝑖𝑛𝑒𝑡(𝑤, 𝑙(𝑤)) > 𝐶𝑜𝑠𝑖𝑛𝑒𝑡(𝑤,𝑚(𝑤)),
it can be assumed that the sense of word w in twitter discussions is more closer to its
sense in the legal domain than that of the movie-reviews. Thus, if afinnSimilarity(w)
> 0 and 𝑤 ∈ 𝐴, the attribute afinnAssignable(w) is set to true.
Both Algorithm 1, Algorithm 2, and Algorithm 3 are three parts of one major al-
gorithmic approach denoted seperately for readability. Therefore, the functions and
attributes defined in Algorithm 1 are applied globally for both Algorithm 2 and Al-
gorithm 3 as well. The states of the attributes after executing Algorithm 1 will be
35
Algorithm 1 Functions
1: procedure 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑜(𝑤, 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡)(𝑥)
2: if sentiment == N then 𝐷𝑜𝑛 ∪ {𝑤}, 𝑂𝑖 − {𝑤}
3: else if sentiment == P then 𝐷𝑜𝑝 ∪ {𝑤}, 𝑂𝑖 − {𝑤}
4: end if
5: end procedure
6:
7: procedure 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑛(𝑤, 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡)(𝑥)
8: if sentiment == N then 𝐷𝑛𝑛 ∪ {𝑤}
9: else if sentiment == P then 𝐷𝑛𝑝 ∪ {𝑤}, 𝑁𝑖 − {𝑤}
10: else if sentiment == O then 𝐷𝑛𝑜 ∪ {𝑤}, 𝑁𝑖 − {𝑤}
11: end if
12: end procedure
13:
14: procedure 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑝(𝑤, 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡)(𝑥)
15: if sentiment == N then 𝐷𝑝𝑛 ∪ {𝑤}, 𝑃𝑖 − {𝑤}
16: else if sentiment == P then 𝐷𝑝𝑝 ∪ {𝑤}
17: else if sentiment == O then 𝐷𝑝𝑜 ∪ {𝑤}, 𝑃𝑖 − {𝑤}
18: end if
19: end procedure
Algorithm 2 Assigning the Legal Sentiment
1: 𝑃𝑖 = 𝑃𝑚, 𝑁𝑖 = 𝑁𝑚, 𝑂𝑖 = 𝑂𝑚, 𝐷𝑜𝑛 = {}, 𝐷𝑜𝑝 = {}
2: 𝐷𝑛𝑛, 𝐷𝑛𝑝, 𝐷𝑛𝑜, 𝐷𝑝𝑝, 𝐷𝑝𝑛, 𝐷𝑝𝑜 = {}
3: n=0,p=0
4: while 1 + |𝐷𝑜𝑛| > 𝑛 or 1 + |𝐷𝑜𝑝| > 𝑝 do
5: n=1 + |𝐷𝑜𝑛|, p =1 + |𝐷𝑜𝑝|
6: for each word w in 𝑂𝑖 do
7: l = 𝑚𝑜𝑠𝑡𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑙(𝑤)
8: if underRepresented(w) and affinAssignable(w) then
𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑜(𝑤, 𝑎𝑓𝑖𝑛𝑛(𝑤))
9: else if domainSpecific(w) and affinAssignable(w) then
10: 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑜(𝑤, 𝑎𝑓𝑖𝑛𝑛(𝑤))
11: else if domainGeneric(l) and l ∈ 𝑁𝑚 ∪𝐷𝑜𝑛 then
12: if notAntonym(w,l) then 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑜(𝑤,𝑁)
13: end if
14: else if domainGeneric(l) and l ∈ 𝑃𝑚 ∪𝐷𝑜𝑝 then
15: if notAntonym(w,l) then 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑜(𝑤,𝑃 )
16: end if
17: end if
18: end for
19: end while
36
Algorithm 3 Assigning the Legal Sentiment
1: n=0,p=0
2: while 1 + |𝐷𝑛𝑛| > 𝑛 or 1 + |𝐷𝑛𝑝| > 𝑝 do
3: n=1 + |𝐷𝑛𝑛|, p =1 + |𝐷𝑛𝑝|
4: Q = 𝑁𝑖 ∪𝐷𝑜𝑛 ∪𝐷𝑛𝑛, R = 𝑃𝑚 ∪𝐷𝑜𝑝 ∪𝐷𝑛𝑝
5: for each word w in 𝑁𝑖 do
6: l = 𝑚𝑜𝑠𝑡𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑙(𝑤)
7: if domainGeneric(w) then 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑛(𝑤,𝑁)
8: else if domainSpecific(w) and affin(w)==N then
9: 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑛(𝑤,𝑁)
10: else if domainSpecific(w) and notAntonym(w,l) then
11: if l ∈ 𝑄 then 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑛(𝑤,𝑁)
12: else if domainGeneric(l) and 𝑙 ∈ 𝑅 then 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑛(𝑤,𝑃 )
13: end if
14: end if
15: end for
16: end while
17: for each word w in 𝑁𝑖 do
18: 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑛(𝑤,𝑂)
19: end for
20: n=0,p=0
21: while 1 + |𝐷𝑝𝑝| > 𝑝 or 1 + |𝐷𝑝𝑛| > 𝑛 do
22: p=1 + |𝐷𝑝𝑝|, n =1 + |𝐷𝑝𝑛|
23: Q = 𝑁𝑚 ∪𝐷𝑜𝑛 ∪𝐷𝑝𝑛, R = 𝑃𝑖 ∪𝐷𝑜𝑝 ∪𝐷𝑝𝑝
24: for each word w in 𝑃𝑖 do
25: l = 𝑚𝑜𝑠𝑡𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑙(𝑤)
26: if domainGeneric(w) then 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑝(𝑤,𝑃 )
27: else if domainSpecific(w) and affin(w)==P then
28: 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑝(𝑤,𝑃 )
29: else if domainSpecific(w) and notAntonym(w,l) then
30: if l ∈ 𝑅 then 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑝(𝑤,𝑃 )
31: else if domainGeneric(l) and 𝑙 ∈ 𝑄 then
𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑝(𝑤,𝑁))
32: end if
33: end if
34: end for
35: end while
36: for each word w in 𝑃𝑖 do
37: 𝑎𝑠𝑠𝑖𝑔𝑛𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑝(𝑤,𝑂)
38: end for𝑃𝑙 = 𝐷𝑜𝑝 ∪𝐷𝑛𝑝 ∪𝐷𝑝𝑝 , 𝑁𝑙 = 𝐷𝑜𝑛 ∪𝐷𝑛𝑛 ∪𝐷𝑝𝑛
37
transferred to the Algorithm 2 and Algorithm 3. Similarly, the states of the attributes
after executing Algorithm 2 will be transferred to the Algorithm 3. In the algorithms,
P, N, O denotes positive, negative, and neutral sentiments respectively. afinn(w) is the
AFINN sentiment categorization of a given word w. When observing the algorithm, it
can be observed that sentiment of 𝑙(𝑤)) is also considered when determining the cor-
rect sentiments of a word. For a word in 𝑂𝑚, the sentiment of 𝑙(𝑤) will be assigned if
𝑙(𝑤) is domain generic (Algorithm 2). This step was followed as another way to iden-
tify words with sentiment polarities (positive or negative). The sentiments of domain
generic words in 𝑃𝑚 or 𝑁𝑚 will not be changed under any condition. For a domain
specific word w in 𝑃𝑚 or 𝑁𝑚, if 𝑙(𝑤) has a opposite sentiment polarity to that of w,
the sentiment of 𝑙(𝑤) will be assigned to w only if 𝑙(𝑤) is domain generic. All the
domain specific words in 𝑃𝑚 or 𝑁𝑚 that do not satisfy any of the conditions that are
required to assign a positive or negative polarity (Algorithm 3), will be assigned with a
neutral sentiment. This step is taken because such domain specific words have a rela-
tively higher probability to have opposite sentiment polarities in the legal domain, thus
capable of transferring wrong information to the classification models [5]. Assigning
neutral sentiment will reduce the impact of negative transfer that can be caused by
such words (neutral sentiment is better than having the opposite sentiment polarity).
Furthermore, it should be noted that an antonym of a particular word w can be given as
𝑙(𝑤) by the embedding model due to semantic drift. To tackle this challenge, WordNet
[42] was used to check whether a given word w and 𝑙(𝑤) are antonyms. If they are not
antonyms, notAntonyms() attribute is set true. After running the Algorithm 1, Algo-
rithm 2, and Algorithm 3 by taking 𝑃𝑚, 𝑂𝑚, 𝑁𝑚 as the inputs, the word sets 𝐷𝑜𝑛, 𝐷𝑜𝑝
were obtained that consist of words the overall algorithm picked from 𝑂𝑚 as having neg-
ative and positive sentiments respectively. 𝐷𝑜𝑛, 𝐷𝑜𝑝 together with 𝑃𝑚, 𝑁𝑚 were given
to a legal expert in order to annotate the words in these sets based on their sentiments.
|𝐷𝑜𝑛| = 220 and |𝐷𝑜𝑝|=116, thus reducing the required amount of annotations to 925
(925= |𝑊 |, where 𝑊 = 𝐷𝑜𝑝 ∪ 𝐷𝑜𝑛 ∪ 𝑃𝑚 ∪ 𝑁𝑚). After the annotation process, three
word sets 𝑁𝑎, 𝑂𝑎, 𝑃𝑎 were obtained that contains words that are annotated as having
positive, neutral and negative sentiments respectively. Then word sets 𝐷𝑛, 𝐷𝑜, 𝐷𝑝 were
created such that 𝐷𝑛 = {𝑤 ∈ 𝑊 |𝑤 ∈ 𝑁𝑎&𝑤 /∈ 𝑁𝑚}, 𝐷𝑝 = {𝑤 ∈ 𝑊 |𝑤 ∈ 𝑃𝑎&𝑤 /∈ 𝑃𝑚},
𝐷𝑜 = {𝑤 ∈𝑊 |𝑤 ∈ 𝑂𝑎&𝑤 /∈ 𝑂𝑚}. 𝑃𝑙 contains the set of words identified by the overall
algorithm as having positive sentiment and 𝑁𝑙 contains the words identified as having
38
negative sentiment (without human intervention).
5.2.2 Fine Tuning the Recursive Tensor Neural Network Model
As an approach to develop a sentiment classifier for legal opinion texts, 𝑅𝑁𝑇𝑁𝑚 (Stan-
ford Sentiment Annotator) [10] was fine tuned following a similar methodology as pro-
posed by [4]. In the proposed methodology [4], there is no need to further train the
𝑅𝑁𝑇𝑁𝑚 model or to modify the neural tensor layer of the model. Instead, the ap-
proach is purely based on replacing the word vectors. In this approach, if a word v in
a word sequence S have a deviated sentiment 𝑠𝑑 in the legal domain when compared
with its sentiment 𝑠𝑚 as output by the 𝑅𝑁𝑇𝑁𝑚, the vector corresponding to v will
be replaced by the vector of word u, where u is a word from a list of predefined words
that has the sentiment 𝑠𝑑 as output by 𝑅𝑁𝑇𝑁𝑚. When choosing u from the list of
predefined words, PoS tag of w in word sequence S is considered in order to preserve
the syntactic properties of the language. For example, if we consider the phrase Sam
is charged for a crime, as charged is a word that have a deviated sentiment, the vector
corresponding to charged will be substituted by the vector of hated (hated is the word
that matches the PoS of charged from the predefined word list corresponding to the neg-
ative class) [4]. When extending the approach proposed in [4] for three class sentiment
classification, a predefined word list for positive class was developed by mapping a set of
selected words that have positive sentiment in 𝑅𝑁𝑇𝑁𝑚 to each PoS tag. The mapping
can be represented as a dictionary R, where R = {JJ:beautiful, JJR:better, JJS:best,
NN:masterpiece, NNS:masterpieces, RB:beautifully, RBR:beautifully, RBS:beautifully,
VB:reward, VBZ:appreciates, VBP:reward, VBD:won, VBN:won, VBG:pleasing}. For
the negative class and the neutral class, the PoS-word mappings provided by [4] for neg-
ative and non-negative classes were used respectively. Furthermore, instead of annotat-
ing each word in the selected vocabulary to identify words with deviated sentiments, we
used word sets 𝐷𝑛, 𝐷𝑜, 𝐷𝑝 that were derived using the approaches described in Section
4.2.1. From this point onwards, the fine tuned RNTN model developed in this study is
denoted as 𝑅𝑁𝑇𝑁𝑙.
39
5.2.3 BERT based Approach for Legal Sentiment Analysis
An approach based on 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒 embeddings [12] has achieved the state of the art
results for sentiment classification of sentences in SST-5 dataset. In order to adapt the
same approach for our task, following steps were followed. First, sentences with their
sentiment labels were extracted from SST-5 training set. The SST-5 training set consists
of 8544 sentences labelled for 5 class sentiment classification. As our focus is on 3 class
classification, the sentiment labels in the SST training set were converted for 3 class
sentiment classification by mapping very positive, positive labels as positive and very
negative,negative labels as negative. Next, following a similar methodology as described
in [12], canonicalization, tokenization and special token addition were performed as the
preprocessing steps. Then, the classification model was designed following the same
model architecture described in [12], that consists of a dropout regularization and a
softmax classification layer on top of the pretrained BERT layer. Similarly to [12],
𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒 uncased was used as the pretrained model and during the training phase,
dropout of probability factor 0.1 was applied as a measure of preventing overfitting.
Cross Entropy Loss was used as the cost funtion and stochastic gradient descent was
used as the optimizer (batch size was 8). Then, the model was trained using the SST-5
training sentences. As information related to number of training epoch could not be
found in [12], we experimented with 2 and 3 epcohs and calculated the accuracies with a
test set of 500 legal phrases (Section 4,3). When trained for 2 epochs, the accuracy was
57% and for 3 epcohs it was reduced to 52%, possibly due to the overfitting with the
source data. Therefore, 2 was choosen as the number of training epochs. This model
will be denoted as 𝐵𝐸𝑅𝑇𝑚 in next sections.
In order to finetune the BERT based approach to the legal sentiment classification,
the following steps were followed. First we selected sentences in the SST training data,
that consists of words that were identified as having deviated sentiments (words in
𝐷𝑜 ∪𝐷𝑝 ∪𝐷𝑛). If the sentiment label of the sentence S that has a deviated sentiment
word w is different from the sentiment label assigned to w by the legal expert, then S will
be removed from the original SST training dataset as a measure of reducing negative
transfer. For example, if there is a sentence S with word charged and if the sentiment
of S is positive or neutral (sentiment of charged is negative in legal domain), then that
sentence S will be removed from the training set. After removing such sentences, the
40
Table 5.1: Evaluating the word lists generated from Algorithm 1 and Algorithm
2
PolarityMetric Number of Words Percentages
𝑁𝑚 𝑁𝑙 𝑃𝑚 𝑃𝑙 𝑁𝑚 𝑁𝑙 𝑃𝑚 𝑃𝑙
Negative 154 317 17 20 61% 80% 5% 7%
Neutral 96 73 180 89 38% 19% 54% 41%
Positive 3 4 139 181 1% 1% 41% 62%
Total 253 394 336 290 100% 100% 100% 100%
training set was reduced to 6318 instances and this new training set will be denoted by
D from this point forward. Next, for each word w in 𝐷𝑛 or 𝐷𝑝, we randomly selected 2
sentences that contains w from the legal opinion text corpus. Then, the sentiments of
the selected sentences were manually annotated by a legal expert. As |𝐷𝑛| = 206 and
|𝐷𝑝| = 82, only 576 new annotations were needed (|𝐷𝑜| = 230, but words in 𝐷𝑜 were not
considered for this approach as they are having a neutral legal sentiment). Then, these
576 sentences from legal opinion texts were combined together with sentences in D, thus
creating a new training set L that consists of 6894 instances. The above mentioned steps
were followed to remove the negative transfer from the source dataset and also to fine
tune the dataset to the legal domain. Then, L was used to train a BERT based model
using the same architecture, hyper parameters and number of training epochs that were
used to train 𝐵𝐸𝑅𝑇𝑚. The model obtained after this training process is denoted as
𝐵𝐸𝑅𝑇𝑙.
5.3 Experiments and Results
5.3.1 Identifying words with deviated sentiments across the source
and target domains
In order to evaluate the effectiveness of the proposed algorithmic approach when it comes
to identifying legal sentiment of a word, we have compared the positive word list (𝑃𝑙)
and negative word list(𝑁𝑙) identified by the algorithm with 𝑃𝑚 and 𝑁𝑚 respectively as
shown in Table 5.1. The way in which 𝑃𝑙 and 𝑁𝑙 were obtained is described in Algorithm
2 and Algorithm 3. It can be observed that the precision of identifying words with
negative sentiments is 80% in the algorithmic approach and it is a 19% improvement
when compared with the 𝑅𝑁𝑇𝑁𝑚 [10]. Furthermore, the number of correctly identified
41
Table 5.2: Precision(P), Recall (R) and F-Measure (F) obtained from the consid-
ered models
ModelMetric Negative Neutral Positive AccuracyP R F P R F P R F
𝑅𝑁𝑇𝑁𝑚 0.51 0.68 0.58 0.44 0.52 0.48 0.48 0.10 0.16 0.48
𝑅𝑁𝑇𝑁𝑙 (Improved) 0.55 0.70 0.62 0.54 0.51 0.52 0.73 0.44 0.55 0.57
𝐵𝐸𝑅𝑇𝑚 0.68 0.73 0.70 0.47 0.68 0.56 0.57 0.13 0.21 0.57
𝐵𝐸𝑅𝑇𝑙 (Improved) 0.72 0.79 0.75 0.58 0.55 0.57 0.70 0.62 0.66 0.67
negative words have increase to 317 from 154. Though the precision of identifying words
with positive sentiment is only 62%, there is an improvement of 21% when compared
with the 𝑅𝑁𝑇𝑁𝑚. Precision of identifying words with positive sentiment is relatively
low due to the fact that most of the words that have a positive sentiment in generic
language usage have a neutral sentiment in the legal domain. Sophisticated analysis in
relation to the neutral class could not be performed due to the large amount of words
available in 𝑂𝑚. When considering these results, it can be seen that the proposed
algorithm has shown promising results when it comes to determining the legal domain
specific sentiment of a word. Additionally, it implies that the proposed algorithmic
approach is successful in identifying words that have different sentiments across the two
domains. This approach can also be extended to other domains easily as domain specific
word embedding models can be trained using an unlabelled corpus. Furthermore, the
proposed algorithmic approach also has the potential to be used in automatic generation
of domain specific sentiment lexicons.
5.3.2 Sentiment Classification
To evaluate the effectiveness of proposed approach to develop sentiment annotators for
legal domain, a test set was prepared. The test set contains 500 sentences that were
obtained from legal opinion text documents. Each of the sentences in the test set was
annotated according to their sentiment in the legal domain. The following step wise
procedure has been used to create this test set.
• From the corpus of legal opinion texts, 500 sentences were extracted randomly.
During the process of sentence extraction, it was made sure that this test set does
not contain any of the sentences that were selected from the legal opinion texts
to train the model 𝐵𝐸𝑅𝑇𝑙.
42
• Then, each of the 500 sentences in this test set was annotated based on the legal
sentiment. The annotations were performed by a legal expert ( A graduate of the
Faculty of Law, University of Colombo).
• After the annotation process, there were 211 sentences that were annotated as
having a negative legal sentiment. The number of sentences that were classified
as having a positive legal sentiment was 121 while there were 168 sentences that
were classified as having a neutral sentiment.
For this experiment, 𝑅𝑁𝑇𝑁𝑚 was taken as the baseline for the Recursive Neural
Tensor Network based approaches and 𝐵𝐸𝑅𝑇𝑚 was taken as the baseline for the BERT
based approaches. The annotations by the legal expert were taken as the ground truth.
The results obtained for the models developed within this study (𝑅𝑁𝑇𝑁𝑙, 𝐵𝐸𝑅𝑇𝑙) as
well as for the baseline models are shown in Table 5.2. Precision, Recall and F Score
were used as the evaluation metrics and they are denoted as P, R and F respectively in
Table 5.2.
Based in on the results as shown in the Table 5.2, the following observations can be
made.
• The accuracy of 𝑅𝑁𝑇𝑁𝑚 model is 0.48, while the 𝑅𝑁𝑇𝑁𝑙 model has achieved
an accuracy of 0.57. In other words, our proposed approach to fine tune RNTN
and similar models by replacing word embeddings has yielded an accuracy im-
provement of 9%.
• The accuracy of 𝐵𝐸𝑅𝑇𝑚 model is 0.57, while 𝐵𝐸𝑅𝑇𝑙 model has achieved an
accuracy of 0.67. This accuracy improvement demonstrates the effectiveness of
our approach to fine tune a source dataset for a target dataset while using a
minimum number of human annotations.
• The accuracy values obtained for both 𝐵𝐸𝑅𝑇𝑚 and 𝑅𝑁𝑇𝑁𝑙 are the same (0.57).
It should be noted the 𝐵𝐸𝑅𝑇𝑚 is purely trained using a dataset developed for
movie reviews domain and without finetuning the dataset for the legal domain.
This demonstrated that BERT based approaches are more effective than RNTN
based approaches, when it comes to sentiment analysis. However, it can also
be observed that the recall and F-score values that were achieved by 𝐵𝐸𝑅𝑇𝑚
model for Positive sentiment class is significantly low. In contrast, the 𝑅𝑁𝑇𝑁𝑙
43
model has demonstrated relatively consistent performances over all three classes
(Positive, Neutral and Negative).
• 𝐵𝐸𝑅𝑇𝑙 model outperforms all other models that were considered in our experi-
ments. 𝐵𝐸𝑅𝑇𝑙 model was trained using the modified dataset that was obtained
after fine tuning the SST-5 dataset for the legal domain using the approach pro-
posed in this work under Section 4.2.3. 𝐵𝐸𝑅𝑇𝑙 model has achieved an overall
accuracy of 0.67, which is a 10% accuracy improvement over 𝐵𝐸𝑅𝑇𝑚 model and
also the 𝑅𝑁𝑇𝑁𝑙 model.
• The state of the art accuracy value as denoted in [12] for five class sentiment clas-
sification of SST-5 dataset is 55.5%. Utilizing our proposed approaches, we have
been able to achieve 67% accuracy for three class legal sentiment classification.
It can be considered as a satisfactory accuracy value due to the complex nature
of language used in the legal domain. It should be noted that in order to achieve
these results, only 576 sentences were newly annotated and added to the training
dataset.
The above observations clearly indicate that the transfer learning based approach
based on dataset finetuning as described in Section 4.2.3 is an effective mechanism to
develop sentiment annotators for a low resource domain. As this mechanism is based
on dataset fine tuning, it should be noted that this method can be used with any state
of the art machine learning or deep learning technique for sentiment analysis.
5.4 Discussion
In this research task, the main aim was to develop a reliable sentiment annotator to
analyze the sentiments of textual information in legal opinion text while minimizing
the resource requirements. In other words, this chapter describes how a legal sentiment
annotator can be developed in a low resource setting. To this regard, we have made
use of domain adaptation techniques, where the annotated sentiment analysis datasets
in the movie review domain were adapted to the legal domain. Within this work,
we have demonstrated several techniques that can be used to mitigate the issues that
can be caused due to negative transfer. Coming up with an algorithmic approach to
automatically identify the words that have different sentiment in the target domain
44
when compared with the sentiment of the source domain can be considered as a key
contiribution this work has made towards this direction. After identifying such words
with deviated sentiments across the two domains, the algorithmic approach also consists
of a mechanism to assign the target domain sentiment to the identified words. The data
sets prepared within this study to train and evaluate the sentiment analysis models are
made publicly available 1.
1https://osf.io/zwhm8/
45
Chapter 6
CONCLUSION AND FUTURE WORK
As demonstrated in Chapter 4, the approaches that were developed within this work to
develop sentiment annotators for the legal domain in low resource settings have shown
promising results with significant improvements when compared with the existing works.
The results further demonstrate that the domain specific word embeddings can be effec-
tively utilized to minimize the drawbacks that are caused due to the negative transfer
when adapting a dataset from one domain to another domain. Also, the evaluations
that are described in Chapter 4 demonstrate that the algorithmic approach proposed
in this work to automatically identify words with deviated sentiments across the source
and target domains and to assign the appropriate target domain sentiment is successful
and effective. Additionally, as detailed in Chapter 3, within this work, we have also
developed a contextual verb similarity dataset for the legal domain named LeCoVe to
overcome several drawbacks that are present within the existing similar evaluation re-
sources. Using LeCoVe we have evaluated the effectiveness of the word representations
of several word embedding and language modelling techniques when they are applied
to the legal opinion text domain. During the process, we have also created new legal
domain specific Sense2Vec models and also post trained a BERT model using a corpus
of legal opinion text corpus. Finally, the datasets that were developed to train and
evaluate the sentiment analysis models in the legal domain, the verb similarity dataset
LeCoVe , the legal domain specific Sense2Vec models, and the BERT model post trained
using a corpus of legal opinion texts have been made publicly available to the research
community.
As future work, we believe that the approaches we have developed within this study
for sentence level and phrase level sentiment analysis in the legal opinion text can be
extended in order to be used in party based sentiment analysis in legal opinion texts.
Also, the techniques we have described within this work to minimize the negative trans-
fer are developed in a way that they can be easily adapted to other domain adaptation
tasks in the Natural Language Processing domain. Moreover, LeCoVe can also be used
to evaluate the effectiveness of other word representation and language modelling tech-
46
niques such as ELMO and XLNet. This can be seen as another future work related to
this research work.
47
References
[1] United States Supreme Court. Lee v. US. In Supreme Court, volume 137, page
1958. Supreme Court, 2018.
[2] Gathika Ratnayaka, Thejan Rupasinghe, Nisansa de Silva, Viraj Salaka Gamage,
Menuka Warushavithana, and Amal Shehan Perera. Shift-of-perspective identifi-
cation within legal cases. arXiv preprint arXiv:1906.02430, 2019.
[3] Yi-Hung Liu and Yen-Liang Chen. A two-phase sentiment analysis approach for
judgement prediction. Journal of Information Science, 44(5):594–607, 2018.
[4] Viraj Gamage, Menuka Warushavithana, Nisansa de Silva, Amal Shehan Perera,
Gathika Ratnayaka, and Thejan Rupasinghe. Fast approach to build an auto-
matic sentiment annotator for legal domain using transfer learning. arXiv preprint
arXiv:1810.01912, 2018.
[5] Raksha Sharma, Pushpak Bhattacharyya, Sandipan Dandapat, and Himan-
shu Sharad Bhatt. Identifying transferable information across domains for cross-
domain sentiment classification. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages 968–
978, 2018.
[6] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas.
Sentiment strength detection in short informal text. Journal of the American
society for information science and technology, 61(12):2544–2558, 2010.
[7] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: an
enhanced lexical resource for sentiment analysis and opinion mining. In Lrec,
volume 10, pages 2200–2204, 2010.
[8] Margaret M Bradley and Peter J Lang. Affective norms for english words (anew):
Instruction manual and affective ratings. Technical report, Technical report C-1,
the center for research in psychophysiology, 1999.
[9] Finn Årup Nielsen. A new anew: Evaluation of a word list for sentiment analysis
in microblogs. arXiv preprint arXiv:1103.2903, 2011.
48
[10] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic com-
positionality over a sentiment treebank. In Proceedings of the 2013 conference on
empirical methods in natural language processing, pages 1631–1642, 2013.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805, 2018.
[12] Manish Munikar, Sushil Shakya, and Aakash Shrestha. Fine-grained sentiment
classification using bert. In 2019 Artificial Intelligence for Transforming Business
and Society (AITB), volume 1, pages 1–5. IEEE, 2019.
[13] Jack G Conrad and Frank Schilder. Opinion mining in legal blogs. In Proceedings of
the 11th international conference on Artificial intelligence and law, pages 231–236,
2007.
[14] Tomas Mikolov et al. word2vec. URL https://code. google. com/p/word2vec, 2013.
[15] Jeffrey Pennington et al. Glove: Global vectors for word representation. In
EMNLP, pages 1532–1543, 2014.
[16] Andrew Trask et al. sense2vec-a fast and accurate method for word sense disam-
biguation in neural word embeddings. arXiv preprint arXiv:1511.06388, 2015.
[17] Matthew E Peters et al. Deep contextualized word representations. arXiv preprint
arXiv:1802.05365, 2018.
[18] Zhilin Yang et al. Xlnet: Generalized autoregressive pretraining for language un-
derstanding. arXiv preprint arXiv:1906.08237, 2019.
[19] Christopher Schröder and Andreas Niekler. A survey of active learning for text
classification using deep neural networks. arXiv preprint arXiv:2008.07267, 2020.
[20] Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi,
Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches
for nlp. arXiv preprint arXiv:2105.03075, 2021.
49
[21] Vern R Walker et al. Semantic types for computational legal reasoning: proposi-
tional connectives and sentence roles in the veterans’ claims dataset. In ICAIL,
pages 217–226. ACM, 2017.
[22] Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vin-
dula Jayawardana, Dimuthu Lakmal, and Madhavi Perera. Synergistic union of
word2vec and lexicon for domain specific semantic similarity. In 2017 IEEE In-
ternational Conference on Industrial and Information Systems (ICIIS), pages 1–6.
IEEE, 2017.
[23] Vindula Jayawardana et al. Semi-supervised instance population of an ontology
using word vector embedding. In ICTer, pages 1–7. IEEE, 2017.
[24] Vindula Jayawardana et al. Word vector embeddings and domain specific semantic
based semi-supervised ontology instance population. ICTer, 10(1):1, 2017.
[25] Vindula Jayawardana et al. Deriving a representative vector for ontology classes
with instance word vector embeddings. In INTECH, pages 79–84. IEEE, 2017.
[26] Keet Sugathadasa et al. Legal document retrieval using document vector embed-
dings and deep learning. In Science and Information Conference, pages 160–175.
Springer, 2018.
[27] Felix Hill et al. Simlex-999: Evaluating semantic models with (genuine) similarity
estimation. Computational Linguistics, 41(4):665–695, 2015.
[28] Daniela Gerz et al. Simverb-3500: A large-scale evaluation set of verb similarity.
arXiv preprint arXiv:1608.00869, 2016.
[29] Eric H Huang et al. Improving word representations via global context and mul-
tiple word prototypes. In ACL, pages 873–882. Association for Computational
Linguistics, 2012.
[30] Ray S Jackendoff. Semantic interpretation in generative grammar. 1972.
[31] Beth Levin. English verb classes and alternations: A preliminary investigation.
University of Chicago press, 1993.
50
[32] Kevin D Ashley and Vern R Walker. Toward constructing evidence-based legal
arguments using legal decision documents and machine learning. In ICAIL, pages
176–180. ACM, 2013.
[33] Christopher Manning et al. The stanford corenlp natural language processing
toolkit. In ACL, pages 55–60, 2014.
[34] Kristina Toutanova et al. Feature-rich part-of-speech tagging with a cyclic depen-
dency network. In NAACL-HLT, pages 173–180. Association for Computational
Linguistics, 2003.
[35] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In ACL,
pages 133–138. Association for Computational Linguistics, 1994.
[36] Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psycho-
logical bulletin, 76(5):378, 1971.
[37] J Richard Landis and Gary G Koch. The measurement of observer agreement for
categorical data. biometrics, pages 159–174, 1977.
[38] C Van Rijsbergen. Information retrieval: theory and practice. In Proceedings of
the Joint IBM/University of Newcastle upon Tyne Seminar on Data Base Systems,
pages 1–14, 1979.
[39] Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of
the 49th annual meeting of the association for computational linguistics: Human
language technologies, pages 142–150, 2011.
[40] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In Ad-
vances in neural information processing systems, pages 3111–3119, 2013.
[41] Fréderic Godin. Improving and Interpreting Neural Networks for Word-Level Pre-
diction Tasks in Natural Language Processing. PhD thesis, Ghent University, Bel-
gium, 2019.
[42] Christiane Fellbaum. Wordnet. The encyclopedia of applied linguistics, 2012.
51