In case you were worried that the current iteration of generative AIs are too nice and empathetic, scientists have got you covered – a new language model has been trained on the worst part of the internet, the Dark Web.
Given perhaps the funniest name yet, DarkBERT (yes, that’s actually its name) is a generative AI trained exclusively on the Dark Web in order to compare it to a vanilla counterpart. The team behind it - reporting their findings in a preprint paper that is yet to undergo peer-review - wanted to understand whether using the Dark Web as a dataset would give an AI better context on the language used there, making it more valuable to people wishing to trawl the Dark Web for research and for law enforcement fighting cyber crime.
It also did an extensive trawl of a place that most humans don’t really want to go and indexed its various domains, so thanks for taking one for the team DarkBERT.
The Dark Web is an area of the internet that Google and other search engines ignore, preventing the vast majority of people from going there. It is only accessible by using specialized software called Tor (or similar), and as such has gained quite the reputation for what goes on there. Urban legends have talked of torture rooms, contract killers, and all sorts of horrific crimes, but the truth is that most of it is just scams and other ways to steal your data without the safety of browser security, which we all take very much for granted. Still, the Dark Web is supposedly used by cyber crime networks to anonymously talk, making it an extremely important target for law enforcement.
A team from South Korea hooked up a language model to trawl through the Dark Web using Tor and to return the raw data it found, creating a model that could make better sense of the language used there. Once done, they compared how it performed to existing models the researchers had created prior, including RoBERTa and BERT.
The findings presented in the preprint showed that DarkBERT outperformed the others in all datasets, but it was close. As all the AIs were from a similar framework, it is expected that they would have similar performance, but DarkBERT excelled on the Dark Web specifically.
So, what will DarkBERT be used for? Hopefully it won’t be given the nuclear launch codes, but the team expect it to be a powerful tool in scanning the Dark Web for cybersecurity threats, as well as keeping tabs on forums to identify illicit activity.
Let’s just hope this doesn’t give OpenAI any ideas.
The preprint, which is a preliminary version of a study that has not yet been peer-reviewed, can be found on the arXiv.