Iruda developer exposed to GitHub, a romantic science conversation with real names

Love Medical Science
Love Medical Science

It has been confirmed that ScatterLab, the developer of the artificial intelligence chatbot Eruda, released about 100 conversation data extracted from its service, love medicine, in an open source project repository that anyone can access, without de-identification, but recently deleted it. The data obtained from Love Medicine were used for other purposes without the user’s sufficient consent, and even personal information was exposed.

According to related industries on the 13th, ScatterLab released the’Sentence Generation Module (KG-CVAE-based)’ project implemented by its company in the source code repository GitHub in 2019 as open source.

The sentence generation module is used by computers to process natural language to generate answers to questions. AI-based chatbots such as Eruda are representative applications.

This project was released by ScatterLab for anyone to use by adding and reproducing training such as Korean based on related research results.

The problem is that the data extracted from’Love Medical Science’ as a dataset required for model training for the project was disclosed together without even taking de-identification measures. This issue was first raised by a researcher at the Facebook TensorFlow Korean community. Currently, the project repository has been removed from GitHub.

In the project, Scatterlab stated that the data set source was the science of romance.

ScatterLab stated in the project introduction (Lead Me) that the source of the dataset was “in the case of Korean, conversation data extracted from the science of love was used.”

It is also a problem that users’ conversation data obtained from Love Medicine were used without sufficient consent for the project, but the seriousness of the problem is significant in that personal information was exposed as it was in the conversation.

As a result of directly checking some of the datasets by this magazine, it was confirmed that the real name was exposed as it was, such as “I eat fried rice~XX quickly.”

As a result of opening the dataset, it was confirmed that the real name was included.

The researcher who raised the issue pointed out that “in 100 data sets, unfiltered real names were exposed 20 times.” In addition, the researcher stressed that “not only the real name, but also the region name and disease information were confirmed.”

Scatter Lab has been criticized for not properly taking measures to de-identify personal information while using data from dating science users for learning chatbots. This became a problem as they talked about information that appears as real names and real addresses during conversations with users.

When such a problem was raised, Stacker Lab said, “Because it is difficult for a person to individually inspect 100 million individual sentences, it was mechanically filtered through an algorithm. There were some parts that were left behind.”

Related Articles


AI Bot Eruda Developer “References to people’s real names lack filtering”


Personal Information Protection Commission initiates investigation of personal information leakage of AI bot’Iruda’


AI chatbot’achieved’ temporarily suspended… “Sorry for not notifying the use of KakaoTalk conversation”


Korea Artificial Intelligence Ethics Association “I need to re-release”

However, in this project, ScatterLab did not de-identify personal information for only 100 datasets. It can be said that Scatterlab has revealed that it has carried out the business with a fairly comfortable attitude on the issue of personal information protection.

When Scatterlab inquired about the matter, it said that it was preparing an official position.





Source