1 question found

  • Is GDPR relevant to newswire text corpora? 2 answers answered

    In an industry-oriented research project funded by the European Regional Develpment Fund, our research lab is creating a multilayer text corpus of the Latvian language. The text corpus is being annotated at several layers of syntactic and semantic analysis: treebank, named entities, coreferences, frame semantics, etc. The project proposal was assesed and approved (receiving top scores) by EU experts.

    According to the terms of the funding program, the results of the project must have the potential to be commercialized (the so called knowledge and technology transfer). To foster the implementation of this requirement, our plan (according to the project proposal) is to distribute the language resource ("data set") with a dual licence:

    1. CC BY-NC-SA 4.0 for non-commercial use;
    2. individual licence agreements for commercial use (with the same terms and for the same symbolic fee for all commercial users).

    The full data set (a work-in-progress version) is already publicly available on GitHub (https://github.com/LUMII-AILab/FullStack); applying for a commercial licence is expected to be a good faith. We are also planning to distribute the data set via ELRA and LDC catalogues, as well as META-SHARE and CLARIN, and ELRC (if relevant).

    The first potential customer, a language technology company from USA, has contacted us, and they are willing to sing a commercial licence agreement. We have prepared a draft licence agreement (I have attached it just in case), however, its approval has stuck on my side.

    Since the text corpus is partially based on public newswire texts (~60%), my administration has consulted with some local GDPR experts, because the random newswire text units (mostly, random isolated paragraphs from random articles) contain mentions of random persons. These local experts have concluded that it is most likely illegal (w.r.t. GDPR Article 6 "Lawfulness of processing") to distribute such language resources for commerical use, so my administration says "no".

    No common-sense arguments have been helpful so far:

    • Neither other research groups nor commercial companies will re-distribute the language resource together with their prototypes and products. They will derive neural, statistical, or rule-based language models from the original language resource. The derived language models will be their intellectual property. The derived language models will not contain any IPR or GDPR subjects of the original language resource.
    • I have consulted with researchers from other European universities, who are working on similar language resources. None of them has faced such an obstacle. However, they suggested a reasonable approach: there has to be an option for the personal data subjects to have their data removed (or anonymised), should they request it. Note that we cannot conduct mass-anonymisation of the text corpus, because it will be used also for training automatic named entity recognizers.

    I'm also wondering whether text corpora are relevant at all w.r.t. the term "processing (of personal data)"? In the corpus development, we are not intentionally collecting data and facts about particular persons. Neither we nor the users of this data set will use it as a personal data source (apart from the general machine learning task of named entity recognition).

    Is this indeed the case, that GDPR does not allow us to distribute this language resource for commercial use, i.e., for deriving language models for commercial applications?

    Best regards,

    Normunds Gruzitis
    Head of Artificial Intelligence Lab
    Institute of Mathematics and Computer Science
    University of Latvia

     

    avatar Normunds Gruzitis at Aug 08, 2018 Legal