AI Model Developer Accused of Illegally Scraping Data for Training in China

Thumbnail Image

The information displayed in the AIM should not be reported as representing the official views of the OECD or of its member countries.

Chinese AI writing platform Benshen Zuowen accused former partner Xueersi of illegally scraping 2.58 million pieces of data to train its MathGPT and 'AI Writing Assistant.' Xueersi allegedly admitted to the scraping, but denied using the data in its products. The dispute centers on data rights and potential intellectual property violations.[AI generated]

Why's our monitor labelling this an incident or hazard?

The article describes how the AI model developer used 'crawler' technology to illegally access and cache data from a partner's app without authorization, which was then used to train or enhance their AI model. This unauthorized data scraping is a breach of intellectual property rights and data security obligations, constituting harm under the framework's category of violations of intellectual property rights and applicable law. The involvement of the AI system's development in this unlawful act directly links it to harm, qualifying this as an AI Incident rather than a hazard or complementary information.[AI generated]

AI principles

Privacy & data governanceTransparency & explainabilityAccountabilityRespect of human rights

Industries

Education and trainingConsumer services

Affected stakeholders

Business

Harm types

Economic/PropertyReputationalHuman or fundamental rights

Severity

AI incident

Business function:

Research and developmentCitizen/customer service

AI system task:

Content generationInteraction support/chatbotsReasoning with knowledge structures/planning

Articles about this incident or hazard

Thumbnail Image

笔神作文称学而思AI大模型盗窃其数据后者回应：数据调用均符合合同要求

2023-06-16

chinaz.com

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (AI models and AI assistants) and concerns data usage for AI training. However, the dispute is about alleged unauthorized data access and use, which is currently contested and under legal review. There is no confirmed incident of harm such as intellectual property violation proven or AI misuse causing harm. Therefore, this is not an AI Incident. It also does not describe a plausible future harm scenario beyond the dispute, so it is not an AI Hazard. The main content is about the ongoing legal and reputational dispute, which is complementary information about AI system use and governance issues.

Thumbnail Image

自己人难防！笔神作文炮轰学而思AI大模型：用"爬虫"盗取数据

2023-06-13

驱动之家

Why's our monitor labelling this an incident or hazard?

The article describes how the AI model developer used 'crawler' technology to illegally access and cache data from a partner's app without authorization, which was then used to train or enhance their AI model. This unauthorized data scraping is a breach of intellectual property rights and data security obligations, constituting harm under the framework's category of violations of intellectual property rights and applicable law. The involvement of the AI system's development in this unlawful act directly links it to harm, qualifying this as an AI Incident rather than a hazard or complementary information.

Thumbnail Image

大模型侵权第一案学而思或被起诉偷数据

2023-06-13

凤凰网（凤凰新媒体）

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions that the AI system (MathGPT and AI assistant) was trained using data scraped without permission from a data owner, violating copyright and contractual terms. The harm is realized as the data owner is pursuing legal action for infringement, demanding apology, deletion of data, and compensation. The AI system's development and use directly led to this harm, fulfilling the criteria for an AI Incident under violations of intellectual property rights.

Thumbnail Image

6年成果，被爬取200+万次，仅索赔1元？AI大模型被指控"偷"数据学而思最新回应

2023-06-15

凤凰网（凤凰新媒体）

Why's our monitor labelling this an incident or hazard?

The event involves an AI system (MathGPT) whose development and use allegedly involved unauthorized data scraping from a partner's platform, leading to a violation of data rights and intellectual property. The scraping was admitted by the accused party's algorithm team, and legal action is underway. This constitutes a violation of applicable law protecting intellectual property and data rights, fulfilling the criteria for an AI Incident. Although the monetary claim is symbolic, the harm to rights and data protection is clear and realized. The AI system's role is pivotal as the data was scraped specifically for AI model development. Therefore, this is classified as an AI Incident.

Thumbnail Image

AI大模型数据被盗第一案？笔神作文APP称学而思爬取其数据

2023-06-15

Techweb

Why's our monitor labelling this an incident or hazard?

The event describes a conflict over alleged unauthorized data scraping related to AI model training data, which could plausibly lead to intellectual property rights violations (a form of harm under AI Incident definition). However, since the allegations are not confirmed and no actual harm or breach has been demonstrated or proven, this situation represents a potential risk rather than a realized incident. The main focus is on the dispute and legal actions rather than confirmed harm or malfunction. Therefore, it fits best as an AI Hazard, reflecting a plausible future risk of harm due to unauthorized data use in AI development.

Thumbnail Image

AI大模型数据被盗第一案？细节曝光

2023-06-17

app.myzaker.com

Why's our monitor labelling this an incident or hazard?

The event explicitly involves AI systems (AI-assisted writing software and AI large models) and concerns unauthorized data scraping used for AI training, which infringes on data rights and contractual obligations. This constitutes a violation of intellectual property rights and data security laws, fitting the definition of an AI Incident under category (c) violations of human rights or breach of obligations under applicable law protecting intellectual property rights. The harm is realized, not just potential, as the data was accessed and allegedly used without authorization, causing direct harm to the data owner. The dispute and legal implications further confirm the incident nature rather than a mere hazard or complementary information.

Thumbnail Image

21:51 笔神作文称学而思AI大模型盗取其数据

2023-06-13

每日经济新闻

Why's our monitor labelling this an incident or hazard?

The event involves the use of AI systems, specifically an AI large model (MathGPT) and its new product '作文AI助手'. The allegation is that the AI model was trained using data obtained illegally via web scraping, which constitutes a violation of intellectual property rights. Since the AI system's development involved unauthorized data use, this is a breach of obligations under applicable law protecting intellectual property rights, thus constituting an AI Incident.

Thumbnail Image

笔神作文声讨学而思AI大模型称用"爬虫"技术盗取数据

2023-06-13

金融界网

Why's our monitor labelling this an incident or hazard?

The event explicitly involves an AI system (AI large model) whose development used data obtained through unauthorized scraping, violating contractual and legal data rights. This constitutes a breach of intellectual property rights and data privacy, which is a recognized harm under the AI Incident definition. The involvement of the AI system in the harm is direct, as the data scraping was done to develop or improve the AI model. The event is not merely a potential risk or a complementary update but a realized harm involving AI system use and development.

Thumbnail Image

AI大模型数据被盗第一案？合作伙伴称遭学而思"背刺"

2023-06-14

金融界网

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (large language models for writing assistance) and concerns the unauthorized use of a large dataset of essays, which is alleged to have been stolen and used without permission. This constitutes a violation of intellectual property rights, a recognized harm under the AI Incident definition. The dispute is active, with claims of data theft and misuse directly linked to AI model training and deployment. The presence of realized harm (data theft and unauthorized use) and the involvement of AI systems in the development of AI-assisted writing tools justify classification as an AI Incident rather than a hazard or complementary information.

Thumbnail Image

笔神作文称学而思AI大模型盗窃其数据，学而思回应：未使用其任何数据

2023-06-14

163.com

Why's our monitor labelling this an incident or hazard?

The event involves AI systems in the context of large AI models and data usage for training. The allegation of unauthorized data scraping for AI training could constitute a violation of intellectual property rights if proven, which would be an AI Incident. However, since the accused party denies the use of the data and no confirmed harm or breach has been established, this remains a claim and dispute without confirmed realized harm. Therefore, this event is best classified as Complementary Information, as it provides context on a dispute related to AI data usage and intellectual property but does not confirm an AI Incident or Hazard at this stage.

Thumbnail Image

2023-06-16

wap.stockstar.com

Why's our monitor labelling this an incident or hazard?

The event involves an AI system (AI large language models) and the unauthorized scraping and use of data for AI training, which is a direct violation of data rights and contractual obligations. This constitutes a breach of intellectual property rights and data security laws, which fits the definition of an AI Incident under violations of human rights or breach of obligations under applicable law intended to protect intellectual property rights. The harm is realized as the data owner claims infringement and damage to their data rights and commercial interests. The event is not speculative or potential harm but an actual incident with legal and commercial consequences. Therefore, it is classified as an AI Incident.

Thumbnail Image

近日，笔神作文指控昔日合作伙伴学而思“偷数据”训练自家AI产品，随后学而思对此公开予以否认。笔神作文称之为国内“AI大模型数据被盗第一案”。

2023-06-16

证券之星

Why's our monitor labelling this an incident or hazard?

The event involves an AI system (AI large model) trained on data allegedly obtained without proper authorization, which could plausibly lead to violations of intellectual property rights and other legal harms if proven true. However, the article states that it is currently unclear whether any laws were broken, and no actual harm has been confirmed. The dispute and regulatory concerns highlight potential future risks related to AI training data legality and compliance. Therefore, the event fits the definition of an AI Hazard, as it plausibly could lead to an AI Incident but has not yet done so.

Thumbnail Image

学而思被指“偷数据”训练AI，牵出大模型“隐秘的角落”

2023-06-16

21jingji.com

Why's our monitor labelling this an incident or hazard?

The article involves an AI system in the form of a large AI model trained on data allegedly obtained without proper authorization. However, the dispute is currently a legal and contractual disagreement over data use and intellectual property rights, with no confirmed illegal activity or harm caused by the AI system's use or malfunction. There is no indication that the AI system has directly or indirectly caused injury, rights violations, or other harms. The article also discusses regulatory frameworks and challenges related to AI training data compliance, which is complementary information about the AI ecosystem and governance. Therefore, this event is best classified as Complementary Information rather than an AI Incident or AI Hazard.

Thumbnail Image

合规科技 AI大模型数据被盗第一案？学而思和笔神作文“开撕”

2023-06-16

21jingji.com

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (AI large models and AI-assisted writing software) and concerns the alleged unauthorized scraping and use of data for AI training, which is a misuse of AI system development or use. The dispute is about contract breach and data rights infringement, which are legal and ethical issues related to AI data sourcing. Although the article discusses potential violations of data security laws and intellectual property rights, it does not describe any actual harm or incident that has occurred yet. The event is a legal dispute and a warning about potential AI-related harms from unauthorized data use, fitting the definition of an AI Hazard rather than an AI Incident. It is not merely complementary information because the main focus is on the alleged unauthorized data scraping and its implications, not on responses or ecosystem context. It is not unrelated because AI systems and their data use are central to the event.

Thumbnail Image

6年成果被爬取200+万次仅索赔1元？AI大模型被指控"偷"数据学而思最新回应 - cnBeta.COM 移动版

2023-06-15

cnBeta.COM

Why's our monitor labelling this an incident or hazard?

The event describes an AI system (MathGPT) whose development involved unauthorized scraping of a competitor's data, violating data protection laws and intellectual property rights. This is a direct harm to the data owner and a breach of legal obligations protecting intellectual property and data rights. The AI system's role is pivotal as the scraped data is used to train the AI model. Therefore, this qualifies as an AI Incident due to the realized violation of rights and harm to the data owner. The event is not merely a potential risk or complementary information but a concrete incident involving AI misuse.