Hello! Here to show off a neat project I’ve been working on!

Over on my git site I’ve been working on two (well one, but kinda three?) projects:

This is one of the more involved projects I’ve done. Probably not the hardest, as all the documentation has been pretty good. But probably one of the more complex projects. So let’s break it down!

The crawler

This was the first part I built. It was spawned from a combination of things:

Originally I was trying to use spider-rs. However, this was causing too many issues, with it’s documentation lacking/incorrect it was just too hard to use. Thus I made my own! Using an amalgamation of rust Code and libraries, most notably html5ever and reqwest, it was trivial to download and parse the page.

Of course, to store all this data we need a database. Actually, two databases. To keep myself from having to do the hard work of tracking what sites we’ve been to and where we need to go, I offloaded this burden to a database. SurrealDB to be exact! Using schemaless prototyping was awesome for development, then locking in a schema later <3 ! This project is also using the relational side of surreal to keep track of that site links to what other sites. Writing the queries for this is just so much more fun that traditional SQL (even though they really are pretty close), and that makes a big difference in the developer experience.

Using surreal works great for deciding where we’ve been and where we need to go. But now we need to store all the sites we have visited. For this I chose minio. This is a s3-like storage system that you can locally host using docker. The fact that it’s s3 compatible doesn’t really do anything for us, we are just using it like a remote filesystem.


For hosting the content back to the user we need two parts.

  1. We need a way to interact with minio
  2. We need a way to intercept requests for a remote server, and redirect them to the server defined in #1

Part 1 - The Server

To build this server it was most easily done in rust using rocket. I use this for all my servers because it has a great balance of control and ease of use. It was also beneficial to write this in rust since we are interacting again with minio and the crawler was written in rust. So we can just copy and paste the minio Code into this server. The Code to this server is really pretty simple. If you’d like you can view it here.

The file structure of minio is as follows bucket_name/domain_name/file_path This keeps it simple but still sorted by domain.

Part 2 - Interception

This is where the project turns into the wild west. To make this easier to explain, the server as defined in part 1 I will just be calling “the database”, obviously normal databases don’t have HTTP capabilities but I’ll be pretending it does, when in actuality it’s calling the server which is calling the database.

So let’s walk thru how this is going to work.

  1. We go to a page, such as wikipedia.org. This would create a request like GET https://wikipedia.org, and this is why we need an extension, so we can intercept this.
  2. Intercept the GET https://wikipedia.org request and forward it to the database.
  3. We get back the corresponding HTML from the database, then using the powers of extensions - set the DOM as this incoming HTML.

Notes on the API

The API that I’m currently using for connecting the extension to the database is very simple. Just localhost/s3/{url}. However this causes some interesting issues. Since any given URL will have slashes ”/” in it, that causes problems because the URL of https://wikipedia.org/ gets turned into “HTTPS:” + "" + “wikipedia.org”. Which is not how the crawler was saving them into the database. So to prevent this (this is also what the crawler does) is we base64 encode the whole URL, so https://wikipedia.org turns into aHR0cHM6Ly93aWtpcGVkaWEub3Jn.

However this causes another issue. What about relative links?

You see, a website can link to other resources (js, css, videos, pictures, hyperlinks) using both absolute and relative URLs. An absolute URL would look like this: https://wikipedia.org/picture.png. But if we were already on the main page (https://wikipedia.org/) a link to picture.png or /picture.png would go to the same place. But obviously when you encode https://wikipedia.org/picture.png and /picture.png you get different results, thus they point to different resources in the database.

So we need to save the domain that we are currently on. You might think this would be as easy as just getting the URL bar in Code. However, once intercepted by the extension and redirected to the database, the URL bar shows the database’s URL.

So what needs to happen, when first intercepted, the extension

  1. Saves the domain name to a cookie
  2. Redirects the request

Then on subsequent requests it can prefix this saved domain to the relative path. To make this easier on the API side it looks like this: localhost/s3/{domain}/{path} with “domain” being the raw string and “path” being base64 encoded. Ex: GET http://localhost/s3/en.wikipedia.org/cGljdHVyZS5wbmc=.


Reminder that you can follow the development of the crawler or the server/extension. These are hosted on via gitea so you can subscribe to rss updates of the individual repos if so desired.