Rotate JWT signing key for the better

Rotate JWT signing key for the better
ℹ️
Tuesday morning (07/06/2022), around 7am UTC+2, every Slite users experienced a logout.
This was the final step of authentication security improvements we've planned after we (re)discovered some bad practices we were sitting on since too much time.
👷
Disclaimer: as you surely know if you work in the product development industry, systems evolve a lot. The implementation design on day 1 will be bended after few years of usage (and progressively misusage) to follow new feature requirements, etc...
This blog post talk about a small part of our authentication process evolution which took place during the last 3 years.

Media access

What we call media are everything you can upload to your documents (images, pdf... really any kind of files).

When you upload one, it's done directly on a Google bucket via a signed upload URL. After using the corresponding Google read URL directly for a while, we quickly decided to add a controlled layer on our side.

bucket_capture_20220607_160412.png
Current media example on google private bucket

After uploading the media on Google, we assign an internal ID to the media, save its read URL in our database, and attach it to the related document.

This way we can use a media proxy which will translate the internal ID into its Google read URL, allowing us to easily improve the backend part as we keep a hand on the HTTP request to the media.

It also allows us track all the media, and this helps for example during the organization deletion process (following GDPR compliance).

Permissions

At that time, the security relied on unguessable random ID, and it was good enough.

One of the improvements we wanted to do was to add permission checks to access the media:  If you have the permission to read the document, then you would have the permission to download the media, and so we generate a google signed read URL for it.

Clear, simple, obvious, but...

Export

We also implemented PDF export—and this feature was relying on the fact that the media were publicly accessible. Indeed, if you share the PDF to someone, you surely want them to access the attached media if they click on the media URL.

At this point, new topics went on the table, and we decided to move forward and keep our system protected by the random IDs for a bit longer.

Session token

Our session system relies on a HS256 JWT, saved in the local storage to be used by the front-end to pass in the Authentication  header of GraphQL queries. We also synchronize our can cookie with it to be used during other request like media access.

And as an extra usage, for some mobile and desktop app purposes we needed a way to pass the session token via the url (when we need to do some action in the browser, outside of the app, or in webviews).

Session refresh and expiration

We started with a simple unique expiration and refresh system:

  • A token is refreshed if at least two hours elapsed since it was generated.
  • A token expires after 7 days.

Consequentially, if you don't open Slite for 7 days in a row you'll have to sign-in again.

Yet when implementing JIT user provisioning via OAuth, we needed a way to follow the external authentication provider instruction regarding the token we deliver. Indeed, if you disable a user in your provider, you don't want them to keep access to the corresponding Slite organization.

It was the starting point to have various kind of session tokens. The one produced during OAuth flow expires at the exact same time as the access_token  the provider gives us. A special claim was added to know how we should proceed for the refresh:  in this case it means if the token is expired, we use the saved refresh_token  the provider gave us during OAuth flow to request a new access_token, and so on.

The thing is that the expiration time for those tokens is 1 hour (even if it can be refreshed months after expiration, as long as the refresh_token  is still accepted by the external authentication provider).

For our authentication system, before, we needed at least 1 successful refresh at any time during 7 days.

With this new token, we really need the refresh process to work perfectly...

After struggling a bit with desynchronization between cookie and local storage, we decided to refresh the token in the local storage, in the same way cookie is refreshed in a browser:

When doing a GraphQL query, the response can contain a custom header with a refreshed token (just like the set-cookie  header).

Mobile experience

The mobile usage of Slite is not exactly the same as the app usage. People may use it less frequently, and having 7 days of expiration was too low. So we decided to set it to 30 days (for the OAuth token, it stays 1 hour with potentially "infinite" refresh, as explained before).

Mobile also have a special requirement: the document editor is loaded in a webview, so it doesn't share the cookie and it still needs session to load media images (if we decide at some point to enforce permission check on media).

To fix this, the session token is passed to the webview and added to the media URL.

And this method also ended being used in the PDF export, as we also load the same document editor code to render the document properly...

...and in the darkness bind them

💥
Here we have all the combination of small features, changes, shortcuts, and bad practices which could have lead us to something critical.

A few months ago, when we decided to have a new look at the media access permissions problem, we rediscovered that the PDF export was blocking this improvement--but also rediscovered that session tokens were saved in the PDF URLs if a non image media existed in the document!

At this point we connected the dots, and acknowledged that all these small adjustments we've made along the years led to this quite critical problem:

If someone exports a document (containing non image media) as a PDF and shares it to someone else, this second person (with a bit of technical knowledge) could extract the session token from the PDF and start impersonating the first user...

Fixes

Scoped token

To solve this now urgent problem, we decided to create new kinds of token with various scopes.

One of these scopes is the "document" token. It's a non renewable token (expires after 30 days), contains the document id, and only gives access to all the media of the corresponding document.

This way the PDF still allows people to download media linked inside (during 30 days max after the PDF generation). But that's it, no more potential impersonation.

We also use this token in the mobile app, so the token passing in GET is way less critical than before (in case it leaks in some logs or third parties).

We checked in our logs and didn't find any evidence this has been exploited, but we decided to refuse any token generated before the date of the fix, just in case someone finds out.

Self ban claim

To go further, as we also pass some token in GET during authentication and with email magic links, we wanted to replace the session token by a one-time unique token.

This solution came with a bunch of problems:

  • It would force us to implement a new mechanism on the app, so we'll have to wait for mobile adoption before being able to get rid of the old mechanism
  • It could trigger some issues on unstable networks as the unique token may be lost on the way (and we don't want to impact sign-in stability).

Instead of a unique token, we added a new claim in the token: self ban.

This claim contains a number of seconds (we started with 30) to wait before banning the token the first time we see it.

We already have a working refresh mechanism, so each time we see this special token we deliver a new one and ban it 30 seconds later.

This way, a slow network or potential parallel queries will continue to work and nothing has to be changed on the front side.

JWT signing key rotation

Finally—as mentioned in the first paragraph of this article—to be extra safe, we decided to rotate the JWT signing key.

If some bad actor managed to have access to some token due to our old bad practices and managed to keep their session alive, this final step will have flushed them away.

Media access permissions

Of course, all those fixes also allowed us to terminate the media access permissions enforcement. Every uploaded media is now accessible only if you also have access to the related document.

Conclusion

This was a bumpy ride, but we greatly improved the Slite overall security without reducing UX during the past months.

It taught us that the little things we keep in our backlog can chain together at some point to create something worse, so we should keep an eye on those small creatures!

Anyway, we are now ready for our new chapter: the Slite Discussions!

Feel free to reach us: security@slite.com, we are super happy to share our problems and experiences. We hope you understood what happen and will continue to trust us. We are building Slite with our heart and are ready to share our pains and our successes!