Julian Fell

Serverless Supercookies

Browser privacy has been getting a lot of attention lately. It’s getting harder and harder to navigate the web without leaking personal data to the gremlins lurking in the dark corners of your favourite website.

In some ways Safari leads the way in protecting user privacy through their default disabling of third-party cookies, while it has to be explicitly activated in Chrome and Firefox. The feature I’m interested in amongst the complex machinery protecting user privacy is the browser sandboxing all storage within each domain so you can be identified when a returning to a single site, but your identity is unable to be linked to your activity elsewhere. This presents an issue for the people who want to track your online behaviour and build up a database of the sites you have visited in order to guess your demographics.

The ability get around this little roadblock is extremely valuable to advertisers and analytics vendors as it offers a competitive edge in a saturated market. Naturally, when the straightforward methods of identifying users are blocked, they get their developers to turn to more creative methods.

Fingerprinting is an option, but is well known to be ineffective on homogenous iPhones on 3G connections. Imagine 30 people on the latest iOS, with the latest iPhone on the same train to work in the morning. Fingerprinting is essentially useless for telling them apart thanks to the minimal customisation available.

Okay, so the conventional browser storage methods are sandboxed and the devices themselves aren’t individual enough to be reliably differentiated. Where to from here? As you may have guessed from the title, the answer is hacks involving HTTPS protocols. Because why not.

A bit of background; the protocol we will be exploiting is the HTTP Strict Transport Security (HSTS) protocol. The idea of this protocol is that a server can send back a header instructing the browser to access its domain via https next time it is visited, even when the user requests the http version explicitly. In this scenario, the browser reacts with a 307 (internal redirect) when it sees that it has received this header from the domain in the past.

On the surface this sounds great. It will help people to avoid visiting sites with sensitive information over an unencrypted connection by accident. Terrific. The sticky part is that because this information is cached in the browser, it can be repurposed.

There are multiple ways to take advantage of this cache for user identification (see here for a more in-depth discussion), but the naive way to achieve it is to have lots of domains with an endpoint that simply returns an empty response with the HSTS header set and doesn’t respond at all over http. For this example, lets assume we own the domain sneaky-hsts.com and have set up 0.sneaky-hsts.com/api, 1.sneaky-hsts.com/api7.sneaky-hsts.com/api with this behaviour.

On the client we generate a hexadecimal user ID (lets use 4F). With other methods we would just store this in a third-party cookie or local-storage cross-domain iframe so we could pluck it out later. In this case we convert it to binary (01001111) and each bit represents one of our 8 domains. If a bit is a 1, we send a request to https://n.sneaky-hsts.com/api and if its a 0 we don’t.

Next time this device loads our script it can now attempt to load the http version of each of our domains. The domains that were connected to earlier will be accessed via https (because of our HSTS cache) and the others will fail to connect as our servers won’t respond to HTTP requests. Based on the successful requests and unsuccessful requests, we can now reconstruct our user ID. Failed requests are zeros and successful requests are ones.

Doing it for reals

Phew, that was a super quick rundown on the mechanics on this method for identifying users across multiple domains. Let’s get to using AWS Lambda for a real-life implementation using the serverless infrastructure of the future!

The first step is to setup a wildcard SSL certificate using Certificate Manager for *.sneaky-hsts.com. Next, we need a single lambda function that returns a max-age value for my HSTS header.

The key that is returned isn’t special in itself, but we can easily map it to the response headers through API Gateway. It allows us to route HTTP requests to the lambda function. It needs to be configured to map max-age to the correct header and to allow lenient CORS headers.

Annoyingly, API Gateway doesn’t work with wildcard custom domains (but still allows you to enter them into the console) so I had to configure a custom domain for every. individual. domain. Go on, get clicking (note the 0, 1 and 2 subdomains in the screenshot).

Finally, Route53 can route each subdomain to the corresponding Cloudfront distribution from API Gateway. The fruits of our configuration labour should now look something like this, with our poor little friend being tracked out in front.

Essentially, Cloudfront is pretending to be lots of domains so we can store lots of bits in the browsers HSTS cache (1 bit per domain). Now that we have the infrastructure we need to store our user IDs, all we need is the client code to expose a cross-domain ID store to our javascript tag.

First, we need a few helper functions to generate IDs and convert between hex and binary representation. Note that I’m using 4 bit IDs for brevity but this approach can be easily extended to more bits.

Next we can define some functions for setting and checking bits against the server we have setup.

Finally we can define some (very primitive) functions for setting and getting IDs purely through the HSTS cache. These could be defined more elegantly but they clearly show how it all works.

As we can only store a limited number of bits with this method, doing anything useful with this identification would involve saving these IDs serverside along with the information you can glean from their activity across the web.

Not so fast..

It is important that the ethical implications of purposely circumventing user privacy settings aren’t missed in the haze of new shiny things. This is a particularly crafty way of getting the information that advertisers want and uses bleeding edge tech to accomplish it easily, but it is still an invasion of user privacy. It is actively circumventing the security settings enforced by the user’s choice of browser. Visibility on privacy and security issues need to be made public so that users are equipped to protect themselves from them, so hopefully reading this will have made someone think a little harder about the part they play in this. User or developer, like it or not this applies to us all.