ByteDance, the Chinese company behind TikTok, appears to be accelerating its data collection on the Internet to train its generative artificial intelligence models. Since April, the company has deployed a web crawling bot called Bytespider, according to a study by Kasada, a firm specializing in bot management, which Fortune has accessed. This bot is one of the most aggressive on the Internet, far surpassing the scraping rate of other major tech companies like Google, Meta, Amazon, OpenAI, and Anthropic.
Subscribe to the Softonic newsletter and get the latest in tech, gaming, entertainment and deals right in your inbox.
Subscribe (it's FREE) ►According to Sam Crowther, CEO of Kasada, Bytespider scrapes data at a rate 25 times higher than GPTbot, OpenAI’s scraper bot. Additionally, it surpasses the speed of ClaudeBot, used by Anthropic, by 3,000 times. In the last six weeks, Bytespider’s scraping activity has recorded significant peaks, indicating that ByteDance is doubling its efforts to catch up in the generative AI race.
The Kasada study found that Bytespider does not respect robots.txt, an exclusion standard that instructs bots not to scrape data from certain web pages. Aggressive scraping occurs in a complicated context for ByteDance, as TikTok could be banned in the United States. In April, U.S. President Joe Biden signed a law that requires the company to sell the app for national security reasons or shut it down.

Data collection on the Internet is not new, but the rise of generative AI has sparked controversy, especially regarding copyright infringement. Tech companies use bots to copy data and train their models, which greatly concerns and irritates artists and content creators worldwide, who see how large tech companies use their works without permission, without scruples, and without giving them anything in return.
It is rumored that ByteDance is developing a new AI model, which could be integrated into TikTok’s search function. This tool has been updated in recent months so that users can search in real-time for the most popular keywords, which could help advertisers improve the visibility of their ads.