Extending Privacy

Sam Macbeth’s blog

With Dat you can easily publish a website without having to deal with the hassle of servers and web hosting – just copy your HTML and CSS to a folder and run dat share and your site is online. However, every time you want to update the content on your site there is some manual work involved to copy over new files and update the archive for your site. With many personal sites now using static site generators such as jekyll to create their sites, this can get cumbersome. Systems like Github Pages are much more convenient – automatically publishing your site when you push changes to Github. This post shows how to get a Github Pages level of convenience, using Dat.

As I wrote previously, this site is published on both Dat and HTTPs as follows:

  1. Site is built using Jekyll, outputing a _site directory with HTML and CSS.
  2. Contents of _site are copied to the folder with the current version of the site in.
  3. Run dat sync to sync the changes to Dat version of the site to the network.
  4. A seeder on my webserver pulls down the latest version, which causes the HTTPs site to update.

As running steps 1-3 is a bit tedious, we can automate it. This entire process can be run on continuous deployment, enabling the site to be updated with just a git push.

The core of this, is a script that can update the website's Dat archive with only two bits of input data: The Dat's public and private keys. The keys can be obtained with the handy dat keys command:

# get public key (also its address)
$ dat keys
dat://d11665...
# get private key (keep this secret)
$ dat keys export
[128 chars of hex]

Armed with these two bits of information, we can run the following script anywhere to update the site:

npm install -g dat
dat clone dat://$(public_key)
rsync -rpgov --checksum --delete \
    --exclude .dat --exclude dat.json \
    --exclude .well-known/dat \
    _site/ $(public_key)/
cd $(public_key)
echo $(private_key) | dat keys import
timeout --preserve-status 3m dat share || true

Going through this line by line:

  • npm install -g dat installs the Dat CLI
  • dat clone dat://$(public_key) clones the current version of the site
  • rsync -rpgov --checksum --delete --exclude .dat --exclude dat.json --exclude .well-known/dat _site/ $(public_key)/ copies files from the build directory, _site, to the dat archive we just cloned. We only copy if the contents have changed, and we also delete files which were removed the in the site build. We exclude dat.json and .well-known/dat from this delete because they exist only in the dat archive. We also exclude .dat as to not delete the archive metadata.
  • echo $(private_key) | dat keys import imports the private key for this dat, granting us write access to the archive.
  • timeout --preserve-status 3m dat share || true runs dat share, which syncs the changes back to the network. We keep the process open for 3 minutes to ensure that the content is properly synced, and then return true so as to not throw an error when the timeout inevitably occurs.

As mentioned, we can run this script on a CI/CD system to automate publishing. We must, however, ensure that the private key is kept secret. Luckily most systems should offer a mechanisms for private variables to be securely uploaded and kept hidden from job logs.

There is a risk with this approach – namely that the final dat share operation may not sync a full copy of the changes to the network, or the peer who receives them subsequently disappears from the network. In this case, the archive could enter a broken state, where a full copy of the data can no longer be found. In my case, I run a seed at all times on my server, so I believe the risks of this are not high.

I currently have this automated publishing running as a 'Release pipeline' on Azure Pipelines. This can be manually or automatically triggered with builds of this site done by Build pipeline. This gives me the 'Github Pages' experience that I was looking for, but with an added deployment to the P2P web!

In the previous post I outlined firstly why we would like to be able to load dat websites hosted on the Dat network in Firefox, and the first attempt to do that with the dat-fox WebExtension. In this part we will look at how Dat-webext overcomes the limitations of WebExtensions to provide first-class support for the dat protocol in firefox, and how the method used can also be applied to potentially enable any p2p protocol implemented in node to run in Firefox.

Last time I mentioned three limitations of the current WebExtensions APIs, which make Dat support difficult:

  1. APIs for low-level networking (TCP and UDP sockets) inside the webextension context.
  2. Extension-implemented protocol handlers.
  3. Making custom APIs, like DatArchive, available to pages on the custom protocol.

Libdweb

The first two are being directly addressed by Mozilla's libdweb project, which is prototyping implementations of APIs for TCP and UDP sockets, protocol handlers and more which can be run from WebExtensions. The implementations are done using experimental apis, which is how new WebExtension APIs can be tested and developed for Firefox. The APIs are implemented using Firefox internal APIs (similar to the old legacy extension stack), and can then expose a simple API to the extension.

// protocol handler
browser.protocol.registerProtocol('dweb', (request) => new Response(...))

The limitation of using libdweb for an extension is that, as they are experimental APIs, there are limitations to their use. An extension using these APIs can only be run in debugging mode (which means it will be removed when the browser is closed), or must otherwise be shipped with the browser itself as a privileged 'system' addon. This means that shipping extensions using these features to end-users is currently difficult.

Webextify

The Dat stack is composed of two main components: Hyperdrive, which implements the Dat data structures and sync protocol, and Discovery Swarm which is the network stack used to discover peers to sync data with. The former can already run in the browser, with the use of packagers like browserify that shim missing node libraries. As Hyperdrive does not do any networking, all node APIs it uses can be polyfilled by equivalent browser ones. Discovery swarm, on the other hand, is at its core a networking library, which expects to be able to open TCP and UDP sockets in order to communicate with the network and peers. Therefore, we have two options to get the full stack running in an extension:

  1. Implement an equivalent of discovery-swarm using the libdweb APIs directly, or
  2. implement node's networking using libdweb APIs.

For dat-webext, I went with the latter, primarily because thanks to other developers around the libdweb project, most of the work was already done: Gozala (the prime developer behind libdweb) did an implementation by of node's dgram module using the experimental API underneath, and Substack did the same for net in a gist. To that we add a simple implementation of dns, using the already existing browser.dns API, then we have all the shims needed to 'webextify' the entire dat-node implementation.

Putting this together, we can now use discovery swarm directly in our extension code:

var swarm = require('discovery-swarm');
// do networking things

Then, using a browserify fork:

npm install @sammacbeth/webextify
webextify node_code.js > extension_code.js

Putting it together

Now we have webextify for Dat modules, and the protocol handler API to make a handler for the dat protocol, we can write an extension which can serve content for dat:// URLs with little extra effort, for example using beaker's dat-node library:

const { createNode } = require('@sammacbeth/dat-node')
const node = createNode({ storage, dns })
browser.protocol.registerProtocol('dat', (request) => {
    const url = new URL(request.url)
    const archive = await node.getArchive(request.url)
    const body = await archive.readFile(url.path)
    return new Response(body)
})

Storage of hyperdrive data (to allow offline access) is done using random-access-idb-mutable-file, which provides a fast, Firefox compatible, implementation of the generic random-access-storage API used by hyperdrive.

dat-webext glues together these different pieces to provide a protocol handler with much the same behaviour as in the Beaker browser, including:

  • Versioned Dat URLs: dat://my.site:99/.
  • web_root and fallback_page directives in dat.json.
  • Resolution of index.htm(l)? for URLs that point to folder roots.
  • Directory listing for paths with no index file.

DatArchive

The last requirement is to create a DatArchive object that is present on the window global for dat:// pages. Here, we initially have an issue: the method of injecting this via content-script as we did for dat-fox doesn't work. As custom protocols are an experimental feature, it is not possible to register urls of that protocol for content-script injection with the current APIs. However, as we are using experimental APIs now, we can write a new API to bypass this limitation!

In dat-webext we package an extra experimental API, called processScript. This API allows the extension to register a script to be injected into dat pages. This injection is done using privileged APIs which means we can also guarantee that this injection happens before any script evaluation on the actual page, meaning that we can ensure that DatArchive is present even for inline page scripts – fixing a limitation of the injection method used by dat-fox. The API also exposes a messaging channel so postMessage calls in the page are delivered to the extension background script, messages from background are delivered as 'message' events in the page.

Try it out!

You can test out dat-webext in Firefox Nightly or Developer Edition:

git clone https://github.com/cliqz-oss/dat-webext
cd dat-webext
npm install
npm run build
npm run start

Dat-webext-demo

Summary

Dat-webext allows the dat protocol to be integrated into Firefox, and makes the experience of loading dat:// URLs the same as for any other protocol the browser supports. As Dat syncing and networking now reside in the browser process, as opposed to a separate node process as in dat-fox, data from dat archives is properly stored inside the user profile directory. Resources are also better utilised, as an extra node runtime is not required – all code runs in Firefox's own SpiderMonkey engine.

The challenge with dat-webext is distribution: Firefox addon and security policies mean that it cannot be installed as a plain addon from the store. It also cannot be installed manually without adjusting the browser sandbox levels, which can incur a security risk.

What we can do is bundle the addon with a Firefox build. In this setup the extension is a 'system addon', which permits it to use experimental APIs. We did this with the Cliqz fork of Firefox and tested on the beta channel there. However, there are also further issues to solve with the application sandbox on Mac and Linux blocking the extension creating TCP sockets. Due to this, we don't have the extension fully working yet on this channel, but we're close!

Firefox is not the only possible target for libdweb-based projects though. Firefox is based on Gecko, and with the brilliant GeckoView project, we can have Gecko without Firefox. This opens up lots of possibilities, for example on android the dat-webext extension can run inside a Geckoview and provide dat-capabilities to any app. More on that in a future post!

The libdweb APIs, and the shims for node APIs on-top of them, are shaping up well to enable innovation around how the browser loads the pages it shows. As well as Dat, these APIs are being used to bring WebTorrent and IPFS protocols to Firefox. With webextify we can theoretically compile any node program for the WebExtension platform, and thus open a vast array of possibilities inside the browser.

The Dat protocol enables peer-to-peer sharing of data and files across the web. Like similar technologies such as IPFS and Bittorrent it allows clients to validate the data received, so one can know that the data is replicated properly, but in contrast Dat also supports modification of the resources at a specific address, with fast updates propagated to peers. Other useful properties include private discovery – allowing data shared privately on the network to remain so.

These features have led to a movement to use it has a new protocol for the web, with the Beaker browser pushing innovation around what this new peer-to-peer web could look like. The advantages of using Dat for the web are many-fold:

  • Offline-first: Every site you load is implicitly kept locally, allowing it to be navigated when offline. Similarly, changes to sites (both local and remote) will propagate when connectivity is available, but functionality will always be the same.
  • Transparent and censorship resistant: Sites are always the same for every user – the site owner cannot decide to change site content based on your device or location as is common on the current web. As sites are entirely published in Dat, and there is no server-side code, then all the code running the site can be seen with 'view source'.
  • Self-archiving: Dat versions all mutations of sites, so as long as at least one peer keeps a copy of this data, the history of the site will remain accessible and viewable. This can also keep content online after the original publisher stops serving their content.
  • Enables self-publishing: As servers are no longer required, anyone can push a site with DAT – no server or devops required. Publishing to the P2P web requires no payment, no technical expertise, and no platform lock-in.
  • Resilient: Apps and sites stay up as long as people are using them, even if the original developers have stopped hosting.

The Beaker browser already demonstrates all of these features, but as an electron-based app lacks some of the security features, depth of configuration and extensibility of a fully-fledged browser. For this reason I wanted to explore how we could bring these features to Firefox, and could enable access to the Dat web for the low cost of installing a browser extension. (*Also as I work for a company building a fork of Firefox, I have a vested interest in getting this working in the browser I develop).

This article is split into two parts. The first part describes the dat-fox, a Firefox extension that provides the best-possible Dat support that is possible given the current limitations of the webextensions APIs. The second describes the process and challenges in creating dat-webext a Firefox extension which uses experimental APIs from the libdweb project to build full Dat protocol support into a Webextension, and which is current bundled with the Cliqz Browser nightly build.

There were three main challenges to building Dat support in an extension:

  1. Running Dat in an extension context. Dat is currently only implemented for nodejs (though a Rust implementation is on the way), and uses APIs such as net and dgram which have no analogues in the web-stack. This means that we need to find a way to running this implementation in a webextension, or find alternative ways of communicating with other peers to get the content of DAT sites.
  2. Adding new protocols to the browser, such that it can understand an address starting with dat:// . This then has to be wired with the Dat implementation to return the correct content for that URL such that it can be rendered in the browser.
  3. Adding new web APIs for Dat. In Beaker, a new API, DatArchive was proposed, which allows pages to programmatically read the contents of Dat sites. For sites where the user is the owner, and has write permissions, this API allows writes. This API is innovative as it enables self-mutating sites, and has spawned various 'Dat Apps' which behave like many modern web-apps, yet have no server.

First Attempt: Dat-fox

Early last year, inspired by the whitelisting of p2p protocols for use with the Webextensions protocol handlers API, I started dat-fox to build dat://support in a Webextensions compatible extension. Unfortunately, the current APIs are severely limiting, meaning that all three of the above challenges could only be partially solved.

Webextensions allow protocol handlers to be specified in their manifest, however these function as simple redirects. To render content under these handlers a HTTP server is still required, either on the web, or running locally. As we also cannot run a HTTP server inside the extension, the APIs necessitate the use of an external process that will serve the content for dat:// URLs.

Dat-fox implemented a dat:// protocol handler which redirected to a local process, launched via the native messaging API. The separate process, written in node, manages syncing with the dat network, and acts as a gateway HTTP server so that the browser can load dat:// pages when redirected.

dat-fox-protocol

A further challenge for dat-fox was ensuring the origins of dat:// pages were correct, and that URLs looked correct when browsing. Each Dat archive written as a webpage expects to be on it's own origin. For example, then page dat://example.com/test.html should have the origin example.com. This is important for both the browser's security model, such that localStorage is not shared between sites, and also for calculating relative paths to files from a specific document. A naive implementation of the protocol handler might redirect the browser to http://localhost/example.com/test.html. However, this page would then have the incorrect origin localhost, and could break links in the rendered page.

We solved this issue by tricking the browser into loading pages like http://example.com/test.html via the gateway. Using a PAC file, which allows the browser to dynamically send traffic to different proxies, we can take requests to dat:// URLs, after redirecting and tell the browser to use the gateway as a proxy.

Finally, to support the DatArchive API, this class needed to be added to the window object on Dat pages. While Webextensions do allow for code to be run in the page context via a content-script, this code is sandboxed. This means any modifications to window from the content-script are not seen by the page. Instead, we have to use the content-script to firstly inject a script in the page which creates the DatArchive object. This script then communicates API calls to the content-script via the postmessage API, which in turn relays to the extension background. As Dat operations require the external node process, these must then be further forwarded via native messaging, then the response returned back up the stack. Luckily there are libraries like spanan which make all this async messaging a bit easier to handle.

Conclusion

While dat-fox does enable browsing the Dat web, multiple limitations of the Webextension API mean that this support is second-class: Users have to install a separate executable to bridge to Dat, and when visiting Dat sites the URL bar still displays http:// as the protocol.

To overcome these limitations we have to extend beyond what a standard Webextension can do, using experimental APIs to bring fully-fledged Dat support. In the next post I'll describe how dat-webext bundles the Dat networking stack inside a Webextension using libdweb networking and protocol handler.

Dat is one of several exiting distributed web technologies. It’s simple to use, fast and well designed, and makes it easy to self-host content without any infrastructure. Also, in the Beaker Browser, there is a great demonstration of how Dat could be replace HTTP directly in the browser.

As this blog is just a bunch of static HTML files, that makes it a prime candidate to be hosted as a Dat archive. I decided to try out how easy it is to turn my site into a peer-to-peer site. Turn’s out it’s pretty easy:

1. Build site

As my site is generated with Jekyll, we first need to create the built html version of the site which we want to host.

bundle exec jekyll build

This generates the site into the _site folder.

2. Create a directory for the dat archive

We want to keep the dat archive for the site separate from the git repository. This is because dat will add metadata in the .dat folder to track the history of the archive. If we used the _site folder directly for the archive, this would get overwritten whenever we build the site. Therefore, we will have to resort to copying the build output to the dat folder whenever we want to update the site.

mkdir -p /path/to/dats/sammacbeth.eu
cp -r /path/to/sammacbeth.github.io/_site/* /path/to/dats/sammacbeth.eu/

3. Create the dat archive

Using dat’s create command we can create a dat.json file which can be used to give the site a name and description. This will also generate a dat:// url for us. This command will also initialise the archive with metadata in .dat. In my case I have the following in my dat.json:

{
  "title": "sammacbeth.eu",
  "description": "Sam Macbeth's Blog",
  "url": "dat://d116652eca93bc6608f1c09e5fb72b3f654aa3be2a3bca09bccfbe4131ff9e23"
}

4. Now share!

Now your Dat is ready, you can share it to the p2p web:

dat share

Now your site will be available under the dat url you generated in step 3, in this case dat://d116…f9e23

5. Bridge to HTTP

If you’re already running a site for the normal web, you can now mirror your dat version to your HTTP site. One simple way to do this is to clone your dat in the public html folder on your web server:

dat clone \
    dat://d116652eca93bc6608f1c09e5fb72b3f654aa3be2a3bca09bccfbe4131ff9e23 \
    /path/to/public_html

After cloning, you can also run sync to keep it up to date with changes you make on your local copy:

cd /path/to/public_html
dat sync

This will also mean that your webserver acts as another seeder for your archive, meaning you don’t have to keep seeding locally.

6. Make your P2P address discoverable

The final step improves the discoverability of your dat site, by making visits from dat-enabled browser (i.e. the Beaker Browser) aware of your dat version. An added bonus is that your dat site will then appear with your site’s hostname, rather than the full dat url. In order to do this you have to:

  1. Serve your HTTP site over HTTPS.
  2. Create a /.well-known/dat file which points to your dat address (as described here)

In my case, https://sammacbeth.eu/.well-known/dat contains the following:

dat://d116652eca93bc6608f1c09e5fb72b3f654aa3be2a3bca09bccfbe4131ff9e23
TTL=3600

Note, in order that the .well-known folder is included in your archive, you can add the --ignore-hidden=false option to the dat share command.

Now, when visiting your site over HTTPS, a p2p option will be available:

P2P Version Available

And we now have a nice clean dat url too:

Clean Dat Url

7. Updating the site<

Now everything is set up, you can update your site by simply copying new content into your local dat archive. The webserver will automatically pull in the changes and update the site on the web. If you don’t want to bother with the webserver part, you can also use a service like Hashbase to reach the web with your dat hosted site.

I've written a couple of blog posts explaining how online tracking works, and about how the system we've built at CLIQZ protects against it. Read about it on the CLIQZ blog:

My study on third-party trackers observed on German online-banking portals is now available as a whitepaper.

The study shows that several banks employ trackers on their login or logout pages, and thus allowing the third-party trackers to infer custom of a particular bank for the tracked user. This ties in with my ongoing work on developing the Anti-tracking system available in the Cliqz Browser – which prevents the tracking presented in the study.

'Tracking the trackers', a paper I co-authors with colleagues at Cliqz, was presented today at the World Wide Web Conference.

The paper surveys user tracking on the web, and presents the state-of-the-art anti-tracking system we have developed at Cliqz. Slides of the talk are also available here.

A rough word count of a LaTeX document can be achieved using a combination of the detex and wc command line utilities source

This method has the advantage that it will follow \input and \include commands in the target document. Thus performing word counts on large, multi-source, documents very quickly.

General usage is to use detex to strip all tex markup from a document, then word count the resulting text. For example with the 'wc' command line utility:

$ detex MacbethThesis.tex | wc -w
> 31412

Note that this method seems more accurate to the alternative of copy and pasting the contents of the output pdf file into a text editor and word counting that file. This method splits up hyphenated words into two, and counts page numbers etc. As an example, take the following which converts the same document as above to text and word counts the result:

$ pdftotext MacbethThesis.pdf - | wc -w
> 66641

PDFJam is a collection of scripts which provide an interface to the pdfpages package. There are tools for creating booklets and appending together PDF documents.

Creating Booklets

The pdfbook command reorganises an input document such that when printed it will be in booklet form (after being folded in half).

A simple example of creating an A5 booklet from an A4 input PDF:

$ pdfbook --a4paper Input.pdf