Examining the Data Utilization Issue Between DeepSeek and ChatGPT

Published: 2025-01-31

With the development of AI technology, we consider the need to view data utilization issues not just as conflicts between specific companies, but as a broader concern about "AI data usage in general."

Examining the Data Utilization Issue Between DeepSeek and ChatGPT

Recently, suspicions that “DeepSeek may have stolen information from ChatGPT” have become a topic of discussion. However, the expression “stolen” may be misleading. The word “steal” is used when data is obtained without permission and used improperly, but when examined calmly from technical and contractual perspectives, issues emerge. This article examines the relationship between DeepSeek and OpenAI (ChatGPT’s developer), ethical and legal issues related to data usage, and considers how AI as a whole should approach data collection.

Did DeepSeek Really “Steal” Information from ChatGPT?

First, let’s clarify the expression “steal”:

Possibility that DeepSeek Used OpenAI’s API
According to some information, DeepSeek may have used OpenAI’s API. In this case, the issue is whether they used the API in accordance with OpenAI’s terms of service.
Possibility of Using Data Obtained Through API for Training
OpenAI’s API terms of service typically prohibit using the provided outputs directly as training data. If DeepSeek violated this policy by obtaining large amounts of data through the API and using it for model training, it could potentially violate OpenAI’s policies. However, whether this would be considered “plagiarism” or “illegal activity” requires legal judgment.
The Problem with the Word “Steal”
The expression “steal” is generally used when data is obtained without permission and used improperly. However, if DeepSeek was using OpenAI’s API based on a contract, they likely followed technically and contractually legal procedures. Therefore, the expression “stolen” may not be appropriate.
ChatGPT Itself Also Uses Information from the World
While DeepSeek’s data usage is being scrutinized, it’s important to note that ChatGPT (and other AI models) similarly use vast amounts of data from the world for training. ChatGPT’s training data is generally based on publicly available data on the internet, with attention paid to copyrighted content.

ChatGPT’s Data Collection Methods and Ethical Issues

ChatGPT’s Data Collection Methods
OpenAI’s models obtain training data through the following methods:

Public Datasets
- Uses data widely accessible on the internet (books, papers, Wikipedia, etc.). Only uses data that has been appropriately filtered with consideration for copyright.
Licensed Data
- Some data is used with commercially obtained licenses.
Human Annotation Data
- Data created by OpenAI researchers and annotators to improve model quality.

However, discussions continue regarding copyrighted content, and some media and creators have expressed concerns that “content was used for training without permission.”

Ethical Issues in AI Data Usage
In AI development, transparency in data collection and usage is a very important issue. Even if DeepSeek was improperly using OpenAI’s data, since ChatGPT itself utilizes information from the internet, it may not be fair to criticize DeepSeek one-sidedly.
Technological Evolution and Delayed Legal Framework
AI technology is developing very rapidly, and current laws and regulations are not keeping pace. As competition between companies intensifies, clearer rules are needed regarding which data can be legitimately used and which would be considered misuse.

Should We Really Be Concerned About “AI Data Usage in General”?

The issue between DeepSeek and ChatGPT should be viewed not as a conflict between specific companies, but as a problem of “how AI uses data” as a whole. Here are the challenges:

Transparency of AI Model Training Data
AI development companies need to clearly indicate what data they are using and how. Increased transparency leads to trust.
Establishing Fair Rules
Industry-wide rule-making is needed regarding which data can be legally used for training and under what conditions companies can utilize other companies’ technologies.
Balance Between Open Source and Closed Models
Competition between closed commercial models like OpenAI’s and open-source AI is healthy. However, if environments become excessively closed, technological progress may be hindered.

Conclusion

DeepSeek’s use of OpenAI’s API itself cannot be called “stealing,” and the issue is compliance with contracts and terms of service. ChatGPT also learns using information from the internet, and ethical issues in data usage are common challenges across the AI industry. What’s important is not the conflict between specific companies, but transparency in AI data usage and fair rule-making. As AI becomes more widely used in the future, clear rules for data usage and the realization of a society that balances technological innovation with ethical considerations will be required.