|Category||Society, Legal, Computer Science|
|Keywords||NLP, Data Science|
The legal language in Israel is a unique blend of various languages (Aramaic, Latin, etc.), idioms, professional terms, acronyms, and unconventional usage of words. Legal texts tend to be complex, including a plurality of special entities, such as: references, citations, and abbreviations. This uniqueness impairs the ability to read and process legal texts automatically. For example, the standard legal statement below is an impossible mission for a language model trained to process everyday Hebrew language texts:
In addition it is a challenge to collect legal texts from various databases, different file types and structures, and the diversity between systems. The information is saturated with metadata, furthering the difficulties for an automated solution. In order to provide a solution, there must be a breakthrough in legal language monitoring.
The project consists of two major stages:
The first phase will focus on collecting the documents and the metadata, while simultaneously organizing the information into a database that will allow the training of language models. The database will include over five million public judgments that are technologically available but not accessible for language model training. We intend to extract information from about 20 databases, including: the Judicial Network, the Judiciary Database, the Israeli Law Book, Law Memoranda and Draft Legislation, Judgments of the Supervisors of Land Registration, Decisions of the Custody and Court. At the end of this phase, a transformer-based legal language model(s) will be created (as an extension of the Legal HeBERT model will be developed, as detailed below).
Phase two will be labeling entities in their frameworks, for the purpose of performing Named Entity Resolution tasks in a legal context. With this info we will identify names of commercial companies and associations; Names of the people involved in the process – judges, lawyers, prosecutors, defendants, and witnesses; Citations of laws, regulations and judgments, finances, dates, addresses and specialties.
The data corpus will be made available as open source for the use of commercial and academic research and further product developments.