DScope
Toggle Dark/Light/Auto modeToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeBack to homepage

Architecture

Extended Tour of DScope

The following video is an extended version of the talk on DScope at USENIX Security 2023:

Architecture Overview

DScope Architecture

DScope is a distributed, interactive Internet telescope that can be deployed to public cloud providers. To achieve this, DScope consists of components to provision severers, collect and respond to inbound network traffic, aggregate that data, and produce analyses and useful data products.

Provisioning

The main idea behind DScope’s deployment is that, instead of measuring just a few IP addresses for a long period of time, we can achieve more representative coverage by measuring many IPs for a brief period of time. We do this by continually starting, running, and terminating Amazon EC2 instances in every region and availability zone globally. At any given time, roughly 300 servers are provisioned globally, with plans to increase this footprint over time. Each server runs for 10 minutes and then terminates. Within each availability zone, an AWS spot fleet replaces those instances with fresh ones. While DScope deploys across a variety of both ARM and x86 instance types, the most common type used is the ARM t4g.nano instance.

Collecting Traffic

To interactively collect network traffic across all TCP services, DScope leverages network address translation (NAT) within the Linux kernel to route all incoming TCP traffic to a single service. The DScope service then accepts these connections, collects the original connection information from the kernel, and can interact based on the nature of the connection. While DScope is capable of arbitrary interaction, including inferring client protocols and hosting (fake) application-layer services, our current deployment completes transport-layer (TCP) handshakes and emulates an unresponsive application-layer service. As a result, DScope receives initial application-layer traffic sent by clients. In some cases, like Telnet connections, this means DScope receives nothing. In others, such as HTTP, DScope receives the entire request.

Each DScope instance logs a packet capture (pcap) of all interactions during the 10-minute window. In addition, when requests contain references to domain names those domain names are resolved for historical reference. These resolved domain names are useful for contextualizing the configuration that actually caused a request, such as in the case of latent configuration. Each instance stores pcaps and DNS metadata to an S3 bucket. Each day tens of thousands of these pcaps are produced.

Aggregation

Each day all of the pcaps and metadata collected by DScope are aggregated into a single file for later analysis. These files store metadata about all servers allocated in a given day, and the individual network sessions recorded by each. Although a single file, each server’s collected traffic is kept as a separate network stream. Later, when analyzing the traffic, this will allow for large-scale analysis without the memory overhead of tracking all TCP sessions simultaneously.

Analysis

In the original paper, we performed a variety of one-off analyses to evaluate the efficacy of DScope, optimize deployments, and reach initial conclusions about traffic phenomena. While these were performed over a single week, in reality one of the key strengths of vantage points like DScope is the ability to watch trends over time. For this, we’ve deployed an automated analysis pipeline that produces a variety of useful data products based on aggregated DScope data.

Each night, after DScope aggregates the traffic from the previous day, the automated analysis pipeline is run. The pipeline consists of a variety of steps that take input data (e.g., pcaps) and produce outputs. A dependency manager (similar to GNU make) then determines the order to run these to produce the various data products published on this site. Visualizations are also updated daily as data is collected. The analysis pipeline also pulls third-party data sources (e.g., NVD CVE data and Cisco Snort IDS rules) and propagates updates to other analyses.