Cephalopod autopsies? Nope, today’s article is about conducting forensics on a Squid web proxy/cache. Just as complicated, but less smelly.
Chances are pretty good that you’re reading this page through a web proxy right now, especially if you’re in an enterprise environment. Web proxying and caching have become increasingly popular, for both filtering traffic and speeding up requests. Even consumer ISPs have latched onto the idea (sometimes using similar techniques to insert ads into pages as they are downloaded). That means your web surfing history is probably being recorded in a proxy log somewhere.
Web proxy and cache servers are untapped gold mines for forensic analysts. They often record the web browsing history for an entire organization, all rolled up into one directory. Web caching servers also contain copies of pages themselves, for a limited time.
This is great for forensic analysts (and not so hot from a privacy perspective). Investigators can examine web browsing histories for everyone in an organization all at once. Moreover, it’s possible to reconstruct web pages from the cache. Right now, investigators often simply visit web sites in order to see what they are. This has some serious drawbacks: first, there is no guarantee you’re seeing what the end user saw earlier; and second, your surfing now appears in the server’s activity logs. If the owner of the server is an attacker or suspect, you may well have just tipped them off. It’s much better to first examine the web cache to see what you can find stored locally.
To learn more, I installed Squid, a popular web proxy/cache server, on my lab network and dissected it. There are a number of tools out there that will reconstruct client browsing history, based the access logs. I really liked squidview (which has a Kismet-style interface) and sarg (HTML clickable).
What I didn’t find was public information or tools for reconstructing pages from the web cache. It’s definitely possible. The proxy cache, by its very nature, stores the pages you view on its local hard drive and may later serve those pages to you or someone else. The precise pages it stores and the length of time they are retained vary depending on the specific server configuration and usage.
As a forensic analyst, I wanted to recover those cached pages. I figured, if Squid could do it, so could I.
By changing Squid’s configuration to “offline” mode, you can use wget to extract some pages directly from the local cache. This is handy because it reconstructs the pages automatically, if they exist. However, I wanted to see what information was stored directly in the cache, and access associated headers and metadata.
Squid’s access log is straightforward: it’s essentially a text file which contains a list of client IP addresses and pages accessed. If you correlate these with DHCP and central authentication logs, you can potentially match web surfing activity to a particular network card or user.
The cache directory is far more mysterious. If you simply list the directory contents, here is what you will see:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F swap.state
Daunting. That swap.state file is Squid’s database, which contains a record of every item in the cache. It’s a binary file. If you delete it while Squid isn’t running, Squid will actually re-create it the next time it starts up. (This is helpful if you’re trying to manually edit the Squid cache in order to create lab exercises for, oh, a new class on network forensics.)
Finally, each of those eight-character files contains- yes! – the pages actually cached by Squid. Here is an example. When you surf to a web page, Squid will add some metadata to the top, which includes the full URI and its MD5sum. Squid then stores this, along with the full HTTP reply (headers and body) as a file in one of these subdirectories. If the page is requested later, it can look it up in swap.state and fetch it.
Now let’s extract some content directly from the cache.
Let’s say we’re analyzing web traffic associated with 192.168.1.26. We come across the following entry in Squid’s access.log:
1239739309.653 377 192.168.1.26 TCP_MISS/200 30348 GET http://finickypenguin.files.wordpress.com/2007/10/1161451564593.jpg – DIRECT/18.104.22.168 image/jpeg
Interesting… What is this image? Let’s see if it’s in the cache.
We could analyze swap.state, but I created my own table of the URIs stored in Squid, along with their corresponding cache files. This was for two reasons: first, I didn’t have to rely on the accuracy of Squid’s database; and second, I’m a lazy bum and it’s pretty easy to do using a simple Bash script. The URI is stored near the beginning of each cached page, just after the MD5sum of the URI. If you grep for strings beginning with “http” in the first few lines of each cache file, you’ll find it.
Here’s that file we were looking for:
Now let’s open up that cache file. Running strings on it, we see the following metadata and header info:
Lots of juicy info there. To extract the image itself, let’s open this up in a hex editor. I like to use “bless” on Ubuntu. JPEG images begin with “FFD8,” so extracting this content is fairly easy. Highlight everything before the magic number, click “Cut” and save as 0000036A-edited.jpg.
A quick check with “file” confirms that we got it right:
$ file 0000036A-edited.jpg
0000036A-edited.jpg: JPEG image data, JFIF standard 1.01
Now let’s open it up:
Looks pretty suspicious to me…
|PGP-signed text: 2009-04-18 (current)|