In today’s data-rich world, companies, governments and individuals want to analyze anything and everything they can get their hands on – and the World Wide Web has loads of information. At present, the most easily indexed material from the web is text. But as much as 89 to 96 percent of the content on the internet is actually something else – images, video, audio, in all thousands of different kinds of nontextual data types.
Further, the vast majority of online content isn’t available in a form that’s easily indexed by electronic archiving systems like Google’s. Rather, it requires a user to log in, or it is provided dynamically by a program running when a user visits the page. If we’re going to catalog online human knowledge, we need to be sure we can get to and recognize all of it, and that we can do so automatically.
How can we teach computers to recognize, index and search all the different types of material that’s available online? Thanks to federal efforts in the global fight against human trafficking and weapons dealing, my research forms the basis for a new tool that can help with this effort.
Understanding what’s deep
The “deep web” and the “dark web” are often discussed in the context of scary news or films like “Deep Web,” in which young and intelligent criminals are getting away with illicit activities such as drug dealing and human trafficking – or even worse. But what do these terms mean?
The “deep web” has existed ever since businesses and organizations, including universities, put large databases online in ways people could not directly view. Rather than allowing anyone to get students’ phone numbers and email addresses, for example, many universities require people to log in as members of the campus community before searching online directories for contact information. Online services such as Dropbox and Gmail are publicly accessible and part of the World Wide Web – but indexing a user’s files and emails on these sites does require an individual login, which our project does not get involved with.
The “surface web” is the online world we can see – shopping sites, businesses’ information pages, news organizations and so on. The “deep web” is closely related, but less visible, to human users and – in some ways more importantly – to search engines exploring the web to catalog it. I tend to describe the “deep web” as those parts of the public internet that:
- Require a user to first fill out a login form,
- Present images, video and other information in ways that aren’t typically indexed properly by search services.
The “dark web,” by contrast, are pages – some of which may also have “deep web” elements – that are hosted by web servers using the anonymous web protocol called Tor. Originally developed by U.S. Defense Department researchers to secure sensitive information, Tor was released into the public domain in 2004.
Like many secure systems such as the WhatsApp messaging app, its original purpose was for good, but has also been used by criminals hiding behind the system’s anonymity. Some people run Tor sites handling illicit activity, such as drug trafficking, weapons and human trafficking and even murder for hire.
The U.S. government has been interested in trying to find ways to use modern information technology and computer science to combat these criminal activities. In 2014, the Defense Advanced Research Projects Agency (more commonly known as DARPA), a part of the Defense Department, launched a program called Memex to fight human trafficking with these tools.
Specifically, Memex wanted to create a search index that would help law enforcement identify human trafficking operations online – in particular by mining the deep and dark web. One of the key systems used by the project’s teams of scholars, government workers and industry experts was one I helped develop, called Apache Tika.