How OpenAI’s bot crushed this seven-person firm’s web page ‘like a DDoS attack’

Date:

Share post:

On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his firm’s ecommerce web site was down. It seemed to be some type of distributed denial-of-service assault. 

He quickly found the offender was a bot from OpenAI that was relentlessly making an attempt to scrape his total, monumental web site. 

“We have over 65,000 products, each product has a page,” Tomchuk informed TechCrunch. “Each page has at least three photos.” 

OpenAI was sending “tens of thousands” of server requests attempting to obtain all of it, tons of of hundreds of pictures, together with their detailed descriptions. 

“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he mentioned of the IP addresses the bot used to aim to devour his web site. 

“Their crawlers were crushing our site,” he mentioned “It was basically a DDoS attack.”

Triplegangers’ web site is its enterprise. The seven-employee firm has spent over a decade assembling what it calls the biggest database of “human digital doubles” on the net, which means 3D picture recordsdata scanned from precise human fashions. 

It sells the 3D object recordsdata, in addition to pictures – all the things from arms to hair, pores and skin, and full our bodies – to 3D artists, online game makers, anybody who must digitally recreate genuine human traits.

Tomchuk’s staff, based mostly in Ukraine but in addition licensed within the U.S. out of Tampa, Florida, has a phrases of service web page on its web site that forbids bots from taking its photographs with out permission. However that alone did nothing. Web sites should use a correctly configured robotic.txt file with tags particularly telling OpenAI’s bot, GPTBot, to go away the positioning alone. (OpenAI additionally has a few different bots, ChatGPT-Consumer and OAI-SearchBot, which have their very own tags, in keeping with its data web page on its crawlers.)

Robotic.txt, in any other case referred to as the Robots Exclusion Protocol, was created to inform search engine websites what to not crawl as they index the net. OpenAI says on its informational web page that it honors such recordsdata when configured with its personal set of do-not-crawl tags, although it additionally warns that it will possibly take its bots as much as 24 hours to acknowledge an up to date robotic.txt file.

As Tomchuk skilled, if a web site isn’t correctly utilizing robotic.txt, OpenAI and others take that to imply they will scrape to their hearts’ content material. It’s not an opt-in system.

So as to add insult to damage, not solely was Triplegangers knocked offline by OpenAI’s bot throughout US enterprise hours, however Tomchuk expects a jacked-up AWS invoice due to all the CPU and downloading exercise from the bot.

Robotic.txt additionally isn’t a failsafe. AI firms voluntarily adjust to it. One other AI startup, Perplexity, fairly famously acquired known as out final summer season by a Wired investigation when some proof implied Perplexity wasn’t honoring it.

Every of those is a product, with a product web page that features a number of extra pictures. Utilized by permission.Picture Credit:Triplegangers (opens in a brand new window)

Can’t know for sure what was taken

By Wednesday, after days of OpenAI’s bot returning, Triplegangers had a correctly configured robotic.txt file in place, and in addition a Cloudflare account set as much as block its GPTBot and several other different bots he found, like Barkrowler (an search engine optimization crawler) and Bytespider (TokTok’s crawler). Tomchuk can be hopeful he’s blocked crawlers from different AI mannequin firms. On Thursday morning, the positioning didn’t crash, he mentioned.

However Tomchuk nonetheless has no cheap approach to discover out precisely what OpenAI efficiently took or to get that materials eliminated. He’s discovered no approach to contact OpenAI and ask. OpenAI didn’t reply to TechCrunch’s request for remark. And OpenAI has up to now did not ship its long-promised opt-out software, as TechCrunch just lately reported.

That is an particularly tough subject for Triplegangers. “We’re in a business where the rights are kind of a serious issue, because we scan actual people,” he mentioned. With legal guidelines like Europe’s GDPR, “they cannot just take a photo of anyone on the web and use it.”

Triplegangers’ web site was additionally an particularly scrumptious discover for AI crawlers. Multibillion-dollar-valued startups, like Scale AI, have been created the place people painstakingly tag photographs to coach AI. Triplegangers’ web site accommodates pictures tagged intimately: ethnicity, age, tattoos vs scars, all physique sorts, and so forth.

The irony is that the OpenAI bot’s greediness is what alerted Triplegangers to how uncovered it was. Had it scraped extra gently, Tomchuk by no means would have identified, he mentioned.

“It’s scary because there seems to be a loophole that these companies are using to crawl data by saying “you can opt out if you update your robot.txt with our tags,” says Tomchuk, however that places the onus on the enterprise proprietor to grasp learn how to block them.

openai crawler log
Triplegangers’ server logs confirmed how ruthelessly an OpenAI bot was accessing the positioning, from tons of of IP addresses. Utilized by permission.

He needs different small on-line companies to know that the one approach to uncover if an AI bot is taking a web site’s copyrighted belongings is to actively look. He’s actually not alone in being terrorized by them. Homeowners of different web sites just lately informed Enterprise Insider how OpenAI bots crashed their websites and ran up their AWS payments.

The issue grew magnitudes in 2024. New analysis from digital promoting firm DoubleVerify discovered that AI crawlers and scrapers triggered an 86% improve in “general invalid traffic” in 2024 — that’s, visitors that doesn’t come from an actual person.

Nonetheless, “most sites remain clueless that they were scraped by these bots,” warns Tomchuk. “Now we have to daily monitor log activity to spot these bots.”

When you concentrate on it, the entire mannequin operates a bit like a mafia shakedown: the AI bots will take what they need except you’ve got safety.

“They should be asking permission, not just scraping data,” Tomchuk says.

Related articles

Samsung Galaxy S25 smartphones are powered by a customized Snapdragon 8 Elite SoC

Samsung unveiled a bunch of recent devices at its Unpacked occasion, however the baddest of the bunch had...

Google releases free Gemini 2.0 Flash Pondering mannequin, pressuring OpenAI’s premium technique

Be part of our each day and weekly newsletters for the most recent updates and unique content material...

Neko, the body-scanning startup co-founded by Spotify’s Daniel Ek, snaps up $260M at a $1.8B valuation

Stockholm startup Neko Well being has made a giant guess on shoppers desirous to study their state of...

Quicker, curvier and far more… AI-ier

Final 12 months Samsung embraced synthetic intelligence with its suite of Galaxy AI options. And whereas the brand...