Searching Deep And Dark: Building A Google For The Less Visible Parts Of The Web

A geographical map depicting hotbeds of dark web activity related to illegal products. Larger circles indicate more activity. Christian Mattmann, CC BY-SA

Danielle Andrew 09 Jan 2017, 16:50

The ‘digital Babel fish’

Tika is often referred to as the “digital Babel fish,” a play on a creature called the “Babel fish” in the “Hitchhiker’s Guide to the Galaxy” book series. Once inserted into a person’s ear, the Babel fish allowed her to understand any language spoken. Tika lets users understand any file and the information contained within it.

When Tika examines a file, it automatically identifies what kind of file it is – such as a photo, video or audio. It does this with a curated taxonomy of information about files: their name, their extension, a sort of “digital fingerprint. When it encounters a file whose name ends in ”.MP4,“ for example, Tika assumes it’s a video file stored in the MPEG-4 format. By directly analyzing the data in the file, Tika can confirm or refute that assumption – all video, audio, image and other files must begin with specific codes saying what format their data is stored in.

Once a file’s type is identified, Tika uses specific tools to extract its content such as Apache PDFBox for PDF files, or Tesseract for capturing text from images. In addition to content, other forensic information or "metadata” is captured including the file’s creation date, who edited it last, and what language the file is authored in.

From there, Tika uses advanced techniques like Named Entity Recognition (NER) to further analyze the text. NER identifies proper nouns and sentence structure, and then fits this information to databases of people, places and things, identifying not just whom the text is talking about, but where, and why they are doing it. This technique helped Tika to automatically identify offshore shell corporations (the things); where they were located; and who (people) was storing their money in them as part of the Panama Papers scandal that exposed financial corruption among global political, societal and technical leaders.

Tika extracting information from images of weapons curated from the deep and dark web. Stolen weapons are classified automatically for further follow-up.

Identifying illegal activity

Improvements to Tika during the Memex project made it even better at handling multimedia and other content found on the deep and dark web. Now Tika can process and identify images with common human trafficking themes. For example, it can automatically process and analyze text in images – a victim alias or an indication about how to contact them – and certain types of image properties – such as camera lighting. In some images and videos, Tika can identify the people, places and things that appear.

Additional software can help Tika find automatic weapons and identify a weapon’s serial number. That can help to track down whether it is stolen or not.

Employing Tika to monitor the deep and dark web continuously could help identify human- and weapons-trafficking situations shortly after the photos are posted online. That could stop a crime from occurring and save lives.

Memex is not yet powerful enough to handle all of the content that’s out there, nor to comprehensively assist law enforcement, contribute to humanitarian efforts to stop human trafficking and even interact with commercial search engines.

It will take more work, but we’re making it easier to achieve those goals. Tika and related software packages are part of an open source software library available on DARPA’s Open Catalog to anyone – in law enforcement, the intelligence community or the public at large – who wants to shine a light into the deep and the dark.


Christian Mattmann, Director, Information Retrieval and Data Science Group and Adjunct Associate Professor, USC and Principal Data Scientist, NASA

This article was originally published on The Conversation. Read the original article.

Full Article

If you liked this story, you'll love these

This website uses cookies

This website uses cookies to improve user experience. By continuing to use our website you consent to all cookies in accordance with our cookie policy.