Extending Privacy

Sam Macbeth’s blog

The Dat protocol enables peer-to-peer sharing of data and files across the web. Like similar technologies such as IPFS and Bittorrent it allows clients to validate the data received, so one can know that the data is replicated properly, but in contrast Dat also supports modification of the resources at a specific address, with fast updates propagated to peers. Other useful properties include private discovery – allowing data shared privately on the network to remain so.

These features have led to a movement to use it has a new protocol for the web, with the Beaker browser pushing innovation around what this new peer-to-peer web could look like. The advantages of using Dat for the web are many-fold:

  • Offline-first: Every site you load is implicitly kept locally, allowing it to be navigated when offline. Similarly, changes to sites (both local and remote) will propagate when connectivity is available, but functionality will always be the same.
  • Transparent and censorship resistant: Sites are always the same for every user – the site owner cannot decide to change site content based on your device or location as is common on the current web. As sites are entirely published in Dat, and there is no server-side code, then all the code running the site can be seen with 'view source'.
  • Self-archiving: Dat versions all mutations of sites, so as long as at least one peer keeps a copy of this data, the history of the site will remain accessible and viewable. This can also keep content online after the original publisher stops serving their content.
  • Enables self-publishing: As servers are no longer required, anyone can push a site with DAT – no server or devops required. Publishing to the P2P web requires no payment, no technical expertise, and no platform lock-in.
  • Resilient: Apps and sites stay up as long as people are using them, even if the original developers have stopped hosting.

The Beaker browser already demonstrates all of these features, but as an electron-based app lacks some of the security features, depth of configuration and extensibility of a fully-fledged browser. For this reason I wanted to explore how we could bring these features to Firefox, and could enable access to the Dat web for the low cost of installing a browser extension. (*Also as I work for a company building a fork of Firefox, I have a vested interest in getting this working in the browser I develop).

This article is split into two parts. The first part describes the dat-fox, a Firefox extension that provides the best-possible Dat support that is possible given the current limitations of the webextensions APIs. The second describes the process and challenges in creating dat-webext a Firefox extension which uses experimental APIs from the libdweb project to build full Dat protocol support into a Webextension, and which is current bundled with the Cliqz Browser nightly build.

There were three main challenges to building Dat support in an extension:

  1. Running Dat in an extension context. Dat is currently only implemented for nodejs (though a Rust implementation is on the way), and uses APIs such as net and dgram which have no analogues in the web-stack. This means that we need to find a way to running this implementation in a webextension, or find alternative ways of communicating with other peers to get the content of DAT sites.
  2. Adding new protocols to the browser, such that it can understand an address starting with dat:// . This then has to be wired with the Dat implementation to return the correct content for that URL such that it can be rendered in the browser.
  3. Adding new web APIs for Dat. In Beaker, a new API, DatArchive was proposed, which allows pages to programmatically read the contents of Dat sites. For sites where the user is the owner, and has write permissions, this API allows writes. This API is innovative as it enables self-mutating sites, and has spawned various 'Dat Apps' which behave like many modern web-apps, yet have no server.

First Attempt: Dat-fox

Early last year, inspired by the whitelisting of p2p protocols for use with the Webextensions protocol handlers API, I started dat-fox to build dat://support in a Webextensions compatible extension. Unfortunately, the current APIs are severely limiting, meaning that all three of the above challenges could only be partially solved.

Webextensions allow protocol handlers to be specified in their manifest, however these function as simple redirects. To render content under these handlers a HTTP server is still required, either on the web, or running locally. As we also cannot run a HTTP server inside the extension, the APIs necessitate the use of an external process that will serve the content for dat:// URLs.

Dat-fox implemented a dat:// protocol handler which redirected to a local process, launched via the native messaging API. The separate process, written in node, manages syncing with the dat network, and acts as a gateway HTTP server so that the browser can load dat:// pages when redirected.

dat-fox-protocol

A further challenge for dat-fox was ensuring the origins of dat:// pages were correct, and that URLs looked correct when browsing. Each Dat archive written as a webpage expects to be on it's own origin. For example, then page dat://example.com/test.html should have the origin example.com. This is important for both the browser's security model, such that localStorage is not shared between sites, and also for calculating relative paths to files from a specific document. A naive implementation of the protocol handler might redirect the browser to http://localhost/example.com/test.html. However, this page would then have the incorrect origin localhost, and could break links in the rendered page.

We solved this issue by tricking the browser into loading pages like http://example.com/test.html via the gateway. Using a PAC file, which allows the browser to dynamically send traffic to different proxies, we can take requests to dat:// URLs, after redirecting and tell the browser to use the gateway as a proxy.

Finally, to support the DatArchive API, this class needed to be added to the window object on Dat pages. While Webextensions do allow for code to be run in the page context via a content-script, this code is sandboxed. This means any modifications to window from the content-script are not seen by the page. Instead, we have to use the content-script to firstly inject a script in the page which creates the DatArchive object. This script then communicates API calls to the content-script via the postmessage API, which in turn relays to the extension background. As Dat operations require the external node process, these must then be further forwarded via native messaging, then the response returned back up the stack. Luckily there are libraries like spanan which make all this async messaging a bit easier to handle.

Conclusion

While dat-fox does enable browsing the Dat web, multiple limitations of the Webextension API mean that this support is second-class: Users have to install a separate executable to bridge to Dat, and when visiting Dat sites the URL bar still displays http:// as the protocol.

To overcome these limitations we have to extend beyond what a standard Webextension can do, using experimental APIs to bring fully-fledged Dat support. In the next post I'll describe how dat-webext bundles the Dat networking stack inside a Webextension using libdweb networking and protocol handler.

Dat is one of several exiting distributed web technologies. It’s simple to use, fast and well designed, and makes it easy to self-host content without any infrastructure. Also, in the Beaker Browser, there is a great demonstration of how Dat could be replace HTTP directly in the browser.

As this blog is just a bunch of static HTML files, that makes it a prime candidate to be hosted as a Dat archive. I decided to try out how easy it is to turn my site into a peer-to-peer site. Turn’s out it’s pretty easy:

1. Build site

As my site is generated with Jekyll, we first need to create the built html version of the site which we want to host.

bundle exec jekyll build

This generates the site into the _site folder.

2. Create a directory for the dat archive

We want to keep the dat archive for the site separate from the git repository. This is because dat will add metadata in the .dat folder to track the history of the archive. If we used the _site folder directly for the archive, this would get overwritten whenever we build the site. Therefore, we will have to resort to copying the build output to the dat folder whenever we want to update the site.

mkdir -p /path/to/dats/sammacbeth.eu
cp -r /path/to/sammacbeth.github.io/_site/* /path/to/dats/sammacbeth.eu/

3. Create the dat archive

Using dat’s create command we can create a dat.json file which can be used to give the site a name and description. This will also generate a dat:// url for us. This command will also initialise the archive with metadata in .dat. In my case I have the following in my dat.json:

{
  "title": "sammacbeth.eu",
  "description": "Sam Macbeth's Blog",
  "url": "dat://d116652eca93bc6608f1c09e5fb72b3f654aa3be2a3bca09bccfbe4131ff9e23"
}

4. Now share!

Now your Dat is ready, you can share it to the p2p web:

dat share

Now your site will be available under the dat url you generated in step 3, in this case dat://d116…f9e23

5. Bridge to HTTP

If you’re already running a site for the normal web, you can now mirror your dat version to your HTTP site. One simple way to do this is to clone your dat in the public html folder on your web server:

dat clone \
    dat://d116652eca93bc6608f1c09e5fb72b3f654aa3be2a3bca09bccfbe4131ff9e23 \
    /path/to/public_html

After cloning, you can also run sync to keep it up to date with changes you make on your local copy:

cd /path/to/public_html
dat sync

This will also mean that your webserver acts as another seeder for your archive, meaning you don’t have to keep seeding locally.

6. Make your P2P address discoverable

The final step improves the discoverability of your dat site, by making visits from dat-enabled browser (i.e. the Beaker Browser) aware of your dat version. An added bonus is that your dat site will then appear with your site’s hostname, rather than the full dat url. In order to do this you have to:

  1. Serve your HTTP site over HTTPS.
  2. Create a /.well-known/dat file which points to your dat address (as described here)

In my case, https://sammacbeth.eu/.well-known/dat contains the following:

dat://d116652eca93bc6608f1c09e5fb72b3f654aa3be2a3bca09bccfbe4131ff9e23
TTL=3600

Note, in order that the .well-known folder is included in your archive, you can add the --ignore-hidden=false option to the dat share command.

Now, when visiting your site over HTTPS, a p2p option will be available:

P2P Version Available

And we now have a nice clean dat url too:

Clean Dat Url

7. Updating the site<

Now everything is set up, you can update your site by simply copying new content into your local dat archive. The webserver will automatically pull in the changes and update the site on the web. If you don’t want to bother with the webserver part, you can also use a service like Hashbase to reach the web with your dat hosted site.

I've written a couple of blog posts explaining how online tracking works, and about how the system we've built at CLIQZ protects against it. Read about it on the CLIQZ blog:

My study on third-party trackers observed on German online-banking portals is now available as a whitepaper.

The study shows that several banks employ trackers on their login or logout pages, and thus allowing the third-party trackers to infer custom of a particular bank for the tracked user. This ties in with my ongoing work on developing the Anti-tracking system available in the Cliqz Browser – which prevents the tracking presented in the study.

'Tracking the trackers', a paper I co-authors with colleagues at Cliqz, was presented today at the World Wide Web Conference.

The paper surveys user tracking on the web, and presents the state-of-the-art anti-tracking system we have developed at Cliqz. Slides of the talk are also available here.

A rough word count of a LaTeX document can be achieved using a combination of the detex and wc command line utilities source

This method has the advantage that it will follow \input and \include commands in the target document. Thus performing word counts on large, multi-source, documents very quickly.

General usage is to use detex to strip all tex markup from a document, then word count the resulting text. For example with the 'wc' command line utility:

$ detex MacbethThesis.tex | wc -w
> 31412

Note that this method seems more accurate to the alternative of copy and pasting the contents of the output pdf file into a text editor and word counting that file. This method splits up hyphenated words into two, and counts page numbers etc. As an example, take the following which converts the same document as above to text and word counts the result:

$ pdftotext MacbethThesis.pdf - | wc -w
> 66641

PDFJam is a collection of scripts which provide an interface to the pdfpages package. There are tools for creating booklets and appending together PDF documents.

Creating Booklets

The pdfbook command reorganises an input document such that when printed it will be in booklet form (after being folded in half).

A simple example of creating an A5 booklet from an A4 input PDF:

$ pdfbook --a4paper Input.pdf