Cyber Intelligence

MEDUSA: A Suite for Scalable Web Mining and Cyber Intelligence

The MEDUSA Cyber Intelligence suite constitutes a sophisticated, modular, highly -configurable and -scalable Web mining and intelligence platform that benefits from Artificial Intelligence and Big Data technologies so as to provide intelligence and real-time insights to non-IT domain experts, satisfying the multi-disciplinary needs of end-user organizations that require advanced Web crawling, processing and analytics services.

Scalable Web Crawling

highly configurable crawling engine that facilitates full crawl control, configuring one or more seed URL(s), the overall crawl size and depth, the location of servers, the URLs patterns, as well as, the targeted content type and language
automated expansion of crawling in all variations of the seed URLs using other available “Top Level Domains” (TLDs)
automated expansion of crawling in all transformations of the seed URLs that derive using hacking alphabets, such as Leet (1337)
configuration of the politeness (aggressiveness) policy of crawling, so as to remain stealth and avoid detection
configuration of the revisit policy for each target website so as to capture its dynamic nature (including any creations, updates or deletions), enabling the continuous monitoring of the target
utilization of custom headers and/or cookies during crawling to impersonate real users or agents
anonymous crawling of Dark Web via the Tor network, with the transparent usage of a totally integrated Tor proxy
capturing and fetching all objects and requests of the crawled website, including HTML, XML, CSS, JavaScript, binaries, images, videos
supporting logged-in, authenticated mode for crawling websites, marketplaces and forums, with the use of the integrated scripting engine for authenticating in specific free and open source forums, such as myBB, phpBB, Simple Machine Forums, etc.
support for crawling social networks that have APIs (e.g. Twitter), capturing rich data including users’ profiles, posts and multimedia content extracting entities such as posts’ hashtags and URLs
support for parallel crawling of thousands of target websites

Multi-level Processing Pipeline

complete HTML code and text extraction from any webpage
metadata extraction (like EXIF tag structure) from a range of binary resources and file formats like PDF documents, image files, sound files, office documents, and many others
face detection for identifying human faces in digital images (and videos) utilizing pre-trained models
face recognition against a predefined set of “known” human faces, using a totally integrated deep neural network for enabling clustering, similarity detection and classification tasks
real-time multiple objects detection and recognition in digital images and videos, adopting neural networks technologies
nudity classification and detection of offensive / adult images
keyword and regular expression (REGEX) spotting, for extracting email addresses, telephone numbers, IP addresses, Bitcoin addresses, named geo locations and customizable dictionaries of risk terms
automated auto-generation of knowledge graphs, representing relations and interactions among users, for specific forums (e.g. myBB, phpBB and simple machines forums)

Evidence Collection

automatic capturing of Document Object Model (DOM) browser during crawling
electronic sealing and timestamping of the captured artefacts from the target website
offline browsing of the already crawled websites

Intuitive Analytics

semantically-enriched indexing, faceting and categorization of all data fetched from the crawled websites, allowing free-text search, keyword search, entity classification/correlation -based search, phrase search, complex search, geospatial search, term boosting, spell correction, auto-completion, etc.
automated query expansion, using an unsupervised neural network model that identifies words that occur in similar contexts and/or are also similar in meaning, enabling the natural representation of analogies with “human-like” semantic awareness
graphical query designer that allows the creation of complex queries in an easy, user-friendly way
supporting the creation, reuse and extension of query templates, for improving the efficiency and effectiveness of complex and/or repetitive operations
automated real-time diff analysis, spotting additions, modifications and deletions, among to consecutive visits (crawls) of the same target website

Built-In AI Support and Trained Models

incorporating a neural network for concept extraction, allowing multiple objects detection and recognition, in real-time, in digital images and videos,
incorporating a neural network for face detection, in digital images and videos, that supports facial landmark detection, head pose estimation, facial action unit recognition, facial features extraction and eye-gaze estimation
incorporating a convolutional neural network for nudity assessment, automatically identifying that an image is not suitable/safe for work (NSFW) – including offensive and adult images
ability to train the aforementioned models with a custom media base, fostering the face and object recognition capabilities of the suite
ability to define custom keywords and regex expressions for enriching the knowledge and information extraction capabilities from the crawled text

Integrated Network Tools

ping, checking host connectivity and reporting packet loss and latency
whois, finding out who owns the domain, when that domain expires, to view the configured logs, contact details, etc.
dig, querying Domain Name System (DNS) servers
traceroute, displaying the route (path) and measuring transit delays of packets across an IP network
mmap, scanning networks for determining which hosts are alive in a network
nslookup, quering a DNS server for DNS data
reverse lookup, providing the domain name associated with a particular IP address (reverse DNS lookup)

Multi-modal Alerting

ability to configure triggering events, e.g. start/stop of crawling task, detection of new object/face/person and new keyword detection
push notifications through email and SMS service

Reporting

filtering crawling results based on content type, media classification, geolocation, related cases, etc.
visual representation of crawling results (spider diagrams)
reconstructing social graphs and user activity for specific forums (e.g. myBB, phpBB and simple machines forums)

Interoperability with Third-party Tools

exposure of a sound application programming interface (API) to submit crawling requests
integration with third-party legacy systems adopting the publish/subscribe (pub/sub) pattern
exportable results in multiple, structured format (XML, JSON, CSV, binary)

Download our brochure →