Related to https://github.com/spantaleev/matrix-docker-ansible-deploy/pull/963
This also simplifies Prerequisites, which is great.
It'd be nice if we were doing these checks in some optional manner
and reporting them as helpful messages (using
`matrix_playbook_runtime_results`), but that's more complicated.
I'd rather drop these checks completely.
This variable was previously undefined in the role and was only getting
defined via `group_vars/matrix_servers`.
We now properly initialize it (and its good default value) in the role
itself.
Fixes a regression caused by a5ee39266c.
If the user id and group id were different than 991:991
(which used to be a hardcoded default for us long ago),
there was a mismatch between what Synapse was trying to use (991:991)
and what it was actually started with (in `--user=..`). It was then
trying to change ownership, which was failing.
This was mostly affecting newer installations which were not using the
991:991 defaults we had long ago (since a1c5a197a9).
We're talking about a webserver running on the same machine, which
imports the configuration files generated by the `matrix-nginx-proxy`
in the `/matrix/nginx-proxy/conf.d` directory.
Users who run an nginx webserver on some other machine will need to do
something different.
This give us the possibility to run multiple instances of
workers that that don't expose a port.
Right now, we don't support that, but in the future we could
run multiple `federation_sender` or `pusher` workers, without
them fighting over naming (previously, they'd all be named
something like `matrix-synapse-worker-pusher-0`, because
they'd all define `port` as `0`).
This leads to much easier management and potential safety
features (validation). In the future, we could try to avoid port
conflicts as well, but it didn't seem worth the effort to do it now.
Our port ranges seem large enough.
This can also pave the way for a "presets" feature
(similar to `matrix_nginx_proxy_ssl_presets`) which makes it even easier
for people to configure worker counts.
The quotes around "host" for both `--pid` and `--net` were
causing trouble for me:
> docker: --pid: invalid PID mode.
and:
> docker: Error response from daemon: network "host" not found.
I've also changed the `-v` call to `--mount` for consistency with the
rest of the playbook.
Also includes the dashboards for Synapse and for Node Exporter.
Again has only been tested on debian amd64 so far, but the grafana docker image is available for arm64 and arm32. Nice.
Basic system stats, to show stuff the synapse metrics
can't show such as resource usage by bridges, etc
Seems to work fine as well.
This too has only been tested on debian amd64 so far
I felt that adding another variable was probably going to be the easiest way to do this. I may end up adding another variable to enable this feature, for consistency with some of the other things.
This passes any arguments given to 'matrix-postgres-cli' to the 'psql' command.
Examples:
$ # start an interactive shell connected to a given db
$ sudo matrix-postgres-cli -d synapse
$ # run a query, non-interactively
$ sudo matrix-postgres-cli -d synapse -c 'SELECT group_id FROM groups;'
If they do, our next playbook runs would simply revert it
and report "changed" for that task.
There's no benefit to letting the bridge spew a new config file.
This does not apply to the mautrix whatsapp bridge, because that one
is written in Go (not Python) and takes different flags. There's no
equivalent flag there.
Fixes a regression introduced in f6097fbba1, which was cauing Synapse
to die with this error message:
> ValueError: sender_localpart needs characters which are not URL encoded.
These are just defensive cleanup tasks that we run.
In the good case, there's nothing to kill or remove, so they trigger an
error like this:
> Error response from daemon: Cannot kill container: something: No such container: something
and:
> Error: No such container: something
People often ask us if this is a problem, so instead of always having to
answer with "no, this is to be expected", we'd rather eliminate it now
and make logs cleaner.
In the event that:
- a container is really stuck and needs cleanup using kill/rm
- and cleanup fails, and we fail to report it because of error
suppression (`2>/dev/null`)
.. we'd still get an error when launching ("container name already in use .."),
so it shouldn't be too hard to investigate.
Not specifying bind addresses for the worker resulted in this warning:
> synapse.app - 47 - WARNING - None - Failed to listen on 0.0.0.0, continuing because listening on [::]
Additionally, metrics listening only on 127.0.0.1 seems like a no-op.
Only having it accessible from within the container is likely not what
we intend. Changed that to all interfaces as well.
Whether it actually gets exposed or not depends on the systemd service
and `matrix_synapse_workers_container_host_bind_address`.
This switches the `docker exec` method of spawning
Synapse workers inside the `matrix-synapse` container with
dedicated containers for each worker.
We also have dedicated systemd services for each worker,
so this are now:
- more consistent with everything else (we don't use systemd
instantiated services anywhere)
- we don't need the "parse systemd instance name into worker name +
port" part
- we don't need to keep track of PIDs manually
- we don't need jq (less depenendencies)
- workers dying would be restarted by systemd correctly, like any other
service
- `docker ps` shows each worker separately and we can observe resource
usage
We do this by creating one more layer of indirection.
First we reach some generic vhost handling matrix.DOMAIN.
A bunch of override rules are added there (capturing traffic to send to
ma1sd, etc). nginx-status and similar generic things also live there.
We then proxy to the homeserver on some other vhost (only Synapse being
available right now, but repointing this to Dendrite or other will be
possible in the future).
Then that homeserver-specific vhost does its thing to proxy to the
homeserver. It may or may not use workers, etc.
Without matrix-corporal, the flow is now:
1. matrix.DOMAIN (matrix-nginx-proxy/matrix-domain.conf)
2. matrix-nginx-proxy/matrix-synapse.conf
3. matrix-synapse
With matrix-corporal enabled, it becomes:
1. matrix.DOMAIN (matrix-nginx-proxy/matrix-domain.conf)
2. matrix-corporal
3. matrix-nginx-proxy/matrix-synapse.conf
4. matrix-synapse
(matrix-corporal gets injected at step 2).
This removes some `multi-target.wants` symlinks as well, etc.
But despite systemd saying:
> Removed symlink /etc/systemd/system/matrix-synapse.service.wants/matrix-synapse-worker@appservice:0.service
.. I still see such symlinks tehre for me for some reason, so keeping the
code (below) to find & delete them still seems like a good idea.
There was a `matrix_nginx_proxy_enabled|default(False)` check, but:
- it didn't seem to work reliably for some reason (hmm)
- referring to a `matrix_nginx_proxy_*` variable from within the
`matrix-synapse` role is not ideal
- exposing always happened on `127.0.0.1`, which may not be good enough
for some rarer setups (where the own webserver is external to the host)
I guess it didn't hurt to do it until now, but it's not great serving
federation APIs on the client-server API port, etc.
matrix-corporal doesn't work yet (still something to be solved in the
future), but its firewalling operations will also be sabotaged
by Client-Server APIs being served on the federation port (it's a way to get around its firewalling).
If we load it at runtime, during matrix-synapse role execution,
it's good enough for matrix-synapse and all roles after that,
but.. it breaks when someone uses `--tags=setup-nginx-proxy` alone.
The downside of including this vars file like this in `setup.yml`
is that the variables contained in it cannot be overriden by the user
(in their inventory's `vars.yml`).
... but it's not like overriding these variables was possible anyway
when including them at runtime.
Some people run Coturn or Jitsi, etc., by themselves and disable it
in the playbook.
Because the playbook is trying to be nice and clean up after itself,
it was deleting these Docker images.
However, people wish to pull and use them separately and would rather
they don't get deleted.
We could make this configurable for the sake of this special case, but
it's simpler to just avoid deleting these images.
It's not like this "cleaning things up" thing works anyway.
As time goes on, the playbook gets updated with newer image tags
and we leave so many images behind. If one doesn't run
`docker system prune -a` manually once in a while, they'd get swamped
with images anyway. Whether we leave a few images behind due to the lack
of this cleanup now is pretty much irrelevant.
We log everything in systemd/journald for every service already,
so there's no need for double-logging, bridges rotating log files
manually and other such nonsense.
In short, this makes Synapse a 2nd class citizen,
preparing for a future where it's just one-of-many homeserver software
options.
We also no longer have a default Postgres superuser password,
which improves security.
The changelog explains more as to why this was done
and how to proceed from here.
I had intentionally held it back in 39ea3496a4
until:
- it received more testing (there were a few bugs during the
migration, but now it seems OK)
- this migration guide was written
While administering we will occasionally invoke this script interactively with the "non-interactive" switch still there, yet still sit at the desk waiting for 300 seconds for this timer to run out.
The systemd-timer already uses a 3h randomized delay for automatic renewals, which serves this purpose well.
The `mobile` branch got merged to `master`, which ends up becoming
`:latest`. It's a "rewrite" of the bridge's backend and only
supports a Postgres database.
We'd like to go back (well, forward) to `:latest`, but that will take
a little longer, because:
- we need to handle and document things for people still on SQLite
(especially those with external Postgres, who are likely on SQLite for
bridges)
- I'd rather test the new builds (and migration) a bit before
releasing it to others and possibly breaking their bridge
Brave ones who are already using the bridge with Postgres
can jump on `:latest` and report their experience.
Fixes https://github.com/spantaleev/matrix-docker-ansible-deploy/issues/756
Related to https://github.com/spantaleev/matrix-docker-ansible-deploy/issues/737
I feel like timers are somewhat more complicated and dirty (compared to
cronjobs), but they come with these benefits:
- log output goes to journald
- on newer systemd distros, you can see when the timer fired, when it
will fire, etc.
- we don't need to rely on cron (reducing our dependencies to just
systemd + Docker)
Cronjobs work well, but it's one more dependency that needs to be
installed. We were even asking people to install it manually
(in `docs/prerequisites.md`), which could have gone unnoticed.
Once in a while someone says "my SSL certificates didn't renew"
and it's likely because they forgot to install a cron daemon.
Switching to systemd timers means that installation is simpler
and more unified.
This reverts commit 2a25b63bb6.
Looking at other roles, we trigger building regardless of this.
It's better to always trigger it, because it's less fragile.
If the build fails and we only trigger it on "git changes"
then we won't trigger it for a while. That's not good.
Triggering it each and every time may seem like a waste,
but it supposedly runs quickly due to Docker caching.