Project Alexandria - next-gen learning through personal searching

Tags: #<Tag:0x00007f389ebc3530> #<Tag:0x00007f389ebc33f0> #<Tag:0x00007f389ebc32b0> #<Tag:0x00007f389ebc3170> #<Tag:0x00007f389ebc2fb8>


After Christmas, easily bored?

Or maybe I have infosec ADD. I like being able to lookup things fast, but I also care about the source of information. Goog does not, when it comes to infosec topics. You find a lot of fake articles / marketing campaigns via the common search terms. Now, don’t get me wrong. That is a business model and I get that. But it’s not mine. I want an automated knowledge base. My personal one, to be exact.

What if you could throw all your files to a box, and get a search index?

For personal files on personal servers I use a software called Pydio. It’s like a Dropbox, self-hosted and without 3rd party services. Pydio is integrated with Apache Lucene. All in all a nice package, and free. All the features I use are available in the community version.

Cloud mounts

I host Pydio on a personal KVM instance on a private root server. I mount cloud services, like Amazon Cloud Drive or Google Drive, via Fuse. And then decrypt the content via another encfs mount on top of the cloud mount. That works, but it’s not for everyone. What you get, if you are doing it like this, is:

  • confidentiality - files are encrypted. That means control.
  • availability - you have more storage than you need, and it’s backed up if you mirror important files between the cloud services.

For Amazon Cloud Drive

You want about 100 TB of cloud storage for 60 USD a year? Me too… and using it is quite simple.

  • For uploads I use rclone. It’s the Rsync for the cloud.

  • For mounts I use acdcli. Here’s how that works with encfs.

    acdcli mount /acd
    ENCFS6_CONFIG=/tmp/encfs/.encfs6.xml encfs /acd/Noise/ /mnt/stuff/

I am a bit weird, when it comes to handling these mounts: I use a LVM luks-encrypted GrSec hardended Debian, and keep the KVM VM isolated via iptables NAT. The encfs file in /tmp gets purged once the mount is active.
Sure, that is not military grade security. But it works for me, low-cost. I export the encfs mount to another KVM VM, which runs Pydio. For that I use NFS4, locally between the VMs.

Here are the NFS exports for that:
/mnt/stuff 192.168.100.XXX(ro,sync,no_subtree_check) 192.168.100.YYY(ro,sync,no_subtree_check)

This is an NFS4 read-only export. Now does Pydio work with read-only mounts?

Yes, it does.

Having the cloud-mounts shared read only has advantages for the integrity of the meta-data here. I do not need to refresh the acd fuse mount, unless I add more data. That’s why the NFS mount is ro. Otherwise this can end up in an integrity loss situation.

In case you have large mounts, you should increase the memory limit in the respective php.ini. And make sure you have pdftotext on the system, in case you want it.

Done. All the PDFs will aid our auto-generated domain-focused knowledge base. Our personal corner in the library of Alexandria.

2 Factor - personal - yes

We need a guard. Kind of.

You get the Push notifications via Duo. The personal 2 Factor account I have for this with Duo works since years for me. If that changes the worst case is: I log into my VM and edit the config. That’s one advantage of personal systems: power. I do not need to make a support ticket and answer stupid questions about the color of my dog’s poo in 1992.

Sync client - for sync workspaces

Apart from being able to index cloud shares / backups, I also get a personal Dropbox style client.

It speaks German, a little.

Aren’t we all a little bit German, these days? Nein? Papiere bitte!

The mobile client is swell as well. It speaks English and does its stuff. On iOS and Android. And so on.


OwnCloud server is the Wordpress of personal file share systems.It doesn’t look like something I want to setup. The software has a similar feature set. But it looks like they have sloppy security practices in the software development.

Pydio has a security model at least. There’s one CVE entry. For OwnCloud we need Excel and spread sheets already. 17 code exec vulns, and 5 directory traversals. Who develops OwnCloud There we go. Or better not.


Works for me[tm]. I called it Project Alexandria to remind us of the historic incident: the library of Alexandria did not have full backups, and it burned down.

We can have better security, by making sure that we do not solely rely on one cloud provider. Neither for backups, nor for integrity. If we encrypt our data, we can ensure that confidential documents do not get lost easily. There is always a weakest link, but in case of this setup it’s under our control. That’s a huge difference, speaking of trust.

The weakest link here is the root server, where the mount resides, which can decrypt the data on-the-fly. Obviously there are scenarios where an attacker could get full access and download all the data on-the-fly. But that needs to be targeted. If you re-use your Yahoo or Linked In account for Google Drive, it does not need to be targeted. Especially if you do not encrypt any of the data.

The other point is that the cloud storage costs money. The cost for a personal account at Google Drive and Amazon Cloud Drive is about 120 USD a year. Or 60 if you keep backups local, and use old disks. The total cost of ownership would include the root server, management time etc. In other words: you should have documents and index-able data, which is worth the effort.

At the end of the day we can have neat stuff in the browser, from our private service. For example a Cymatics like audio visualization for MP3s :slight_smile:

If that’s worth some extra costs and extra effort, or not, is up to us.But keeping materials offline and hidden doesn’t contribute to our learning and development, and finding rich content becomes harder every day. Or more expensive. Knowledge isn’t free. And it never was. Pretending that we can just google for information doesn’t make a difference. Because in reality we cannot just google “What are the new EU/US privacy laws and how do I comply with the regulations easily?” Sure there are search results. But are these useful results? Not really.