DMTN-225: User metadata for the Science Platform

  • Russ Allbery

Latest Revision: 2022-05-26

1 Abstract

The Rubin Science Platform will store various metadata about each user, either created by the Science Platform (such as some identifiers) or collected from the relevant identity provider. This document describes the metadata associated with users and its sources and constraints, such as numeric ranges for UIDs and GIDs and valid patterns for usernames and group names.

This document is divided into three sections, one for the IDF and CDF, one for Telescope and Site deployments, and one for the USDF.

2 IDF and CDF

2.1 User metadata

We expect to add additional metadata, such as whether a user has accepted the Acceptable Use Policy, before the production release. This tech note will be updated when we add additional metadata.

2.1.1 Username

Source: User chooses their (unique) username during enrollment.

Storage: In COmanage, it is stored as an identifier associated with the user’s record. CILogon provides a unique opaque identifier during authentication. The user’s LDAP record is then retrieved via a search for that identifier, and the username is taken from the uid field of that LDAP record. It is then also stored in Redis as data associated with each authentication token.

Constraints: Must consist solely of lowercase ASCII letters, numbers, and dash (-). Must be at least two characters long. Must contain at least one lowercase ASCII letter. Must not start or end with a dash Must not contain two consecutive dashes. [1] Usernames for bot users (users created for automated processes or services, not for human users) must begin with bot-.

2.1.2 Numeric UID

Source: Assigned by the Science Platform on first use of an account. All production and integration IDF and CDF Science Platform deployments share the same UID assignment pool and map the same CILogon identity to the same UID. Development deployments may use a different UID assignment pool and therefore not use the same UIDs.

Storage: One document per user is stored in Google Firestore This currently contains only the UID, but in the future may contain other metadata maintained by the Science Platform rather than COmanage and CILogon. As an optimization, since the UID never changes, it is also stored as data associated with each authentication token in Redis.

Constraints: See UID and GID assignment. UIDs are unique and are intended to never be reused. Once assigned, the UID for a given account never changes, even if the username is changed.

2.1.3 Full name

Source: Taken from the user’s federated identity provider during enrollment. The user may choose to enter a new name. Insofar as possible, the Rubin Science Platform will only record the user’s entire name of choice as a single text field, not divided into components such as given name and family name. COmanage currently does not properly support this, and may represent the name in components, but the Science Platform will attempt to use the combined form only.

Storage: The names from each associated federated identity are stored in COmanage, along with any name the user chooses to enter. Whatever name they choose as primary is stored as the displayName attribute in the user’s LDAP record as maintained by COmanage, and is retrieved from there by the Science Platform using the token API.

Constraints: Any valid UTF-8 string of reasonable length without control characters. No assumptions are made about the structure of the name.

2.1.4 Email address

Source: Taken from the user’s federated identity provider during enrollment. The user may choose to enter a new email address.

Storage: The email addresses from each associated federated identity are stored in COmanage, along with any email the user chooses to enter. Whatever email address they choose as primary is stored as the mail attribute in the user’s LDAP record as maintained by COmanage, and is retrieved from there by the Science Platform using the token API.

Constraints: Must be a syntactically-valid RFC 5322 addr-spec. COmanage will confirm the validity of the email address during enrollment by sending the user an email and having them follow a link in the email.

2.1.5 Group membership

Source: COmanage records the user’s group membership (except in their default group). Users are added to groups by group owners, and may be added to groups based on automated rules triggering off of their affiliation data. Users are also automatically a member of a default group with the same name as the username.

Storage: COmanage stores the user’s group membership information and provides it in the LDAP server it maintains, as member attributes in a groups tree. Group membership information is retrieved from LDAP each time it is needed. However, be aware that the scopes of an authentication token are calculated from the group membership at the time of initial user authentication and are not affected by subsequent changes to the user’s group membership until that token expires.

Constraints: There is no inherent limit in the number of groups a user may be a member of, but be aware that NFS only allows a user to be a member of 16 groups, one of which is the user’s default group. Group memberships above 16 may be ignored by the NFS server.

2.2 Group metadata

2.2.1 Group name

(The below rules only apply to additional groups. The user’s default group has the same name as the username and the same GID as the user’s UID.)

Source: Groups are named in COmanage when they are created.

Storage: Group names are stored where user group membership is stored.

Constraints: All group names must begin with g_. Group names must consist of lowercase ASCII letters and numbers, period (.), dash (-), and underscore (_), and must be at most 32 characters long. [2]

2.2.2 Numeric GID

Source: Assigned by the Science Platform on first use of a group. All production and integration IDF and CDF Science Platform deployments share the same GID assignment pool and map the same COmanage group to the same GID. Development deployments may use a different UID assignment pool and therefore not use the same UIDs.

Storage: One document per group is stored in Google Firestore This currently contains only the GID, but in the future may contain other metadata maintained by the Science Platform rather than COmanage and CILogon.

Constraints: See UID and GID assignment. GIDs are unique and are intended to never be reused. Once assigned, the GID for a given group never changes, even if the group name is changed.

2.3 UID and GID assignment

The Science Platform uses a POSIX file system for some storage. Access control in that file system is done via numeric UIDs and GIDs. Each user must therefore be assigned a unique UID, and each group must be assigned a unique GID.

Each user must also have a default group. Following the now-standard Linux convention, that default group will have the same name as the user and will contain only the user. That group must also have a unique GID.

For convenience, the GID of the user’s default group will always match the user’s UID.

The Science Platform requires support for at least 31-bit UIDs and GIDs and makes no attempt to support platforms with 16-bit UIDs or GIDs. We can therefore take advantage of the increased UID and GID space up to 2,147,483,648.

UID and GID space is divided into the following ranges:

0-99

Reserved for the container operating system.

100-999

Reserved for users created by packages installed in containers, and for the use of some containers that use default UIDs in the high 900s.

1000-999999

Reserved for users created inside the container image. Most containers use UID 1000 as a default user. Note that 65534 is reserved by the operating system.

100000-199999

UIDs for bot users and the corresponding GID for the bot user’s default group.

200000-299999

GIDs for groups other than the user’s default group.

300000-999999

User UIDs and the corresponding GID for the user’s default group.

1000000-2147483647

Reserved for future use.

UIDs and GIDs are assigned on first use of a given user or group in any Science Platform deployment that shares the same UID and GID assignment database. We expect to sometimes want to mount the same POSIX file system on multiple deployments, so the same UID and GID assignment store will be shared by all production and integration deployments (but possibly not by development deployments).

Once a given UID or GID has been used, it will never be reused for a different user or group.

COmanage does support assigning UIDs and GIDs, but the configuration complexity required is higher, and our assignment needs are a somewhat awkward fit for COmanage’s capabilities. We therefore will do UID and GID assignment independently of COmanage.

3 Telescope and Site

Currently, Telescope and Site deployments use GitHub for authentication. It’s possible that the summit deployment will switch to a local identity provider at some point in the future to allow for access while the summit is disconnected from the Internet. If this happens, it will likely switch to a model like the USDF as described below.

3.1 User metadata

3.1.1 Username

Source: The user’s GitHub username converted to all lowercase.

Storage: The username is used as a unique key for the user in all identity management systems.

Constraints: Must consist solely of lowercase ASCII letters, numbers, and dash (-), must not start or end with a dash, and must not contain two consecutive dashes. [3] Must not consist entirely of numbers.

3.1.2 Numeric UID

Source: UID assigned by GitHub. For bot users that do not exist in GitHub, we make up a UID when an authentication token for the bot user is created and hope it doesn’t conflict with a meaningful GitHub user.

Storage: Stored as data associated with each token in Redis.

Constraints: Whatever constraints are used by GitHub to assign UIDs.

3.1.3 Full name

Source: Taken from the GitHub account metadata.

Storage: Stored as data associated with each token in Redis.

Constraints: Any valid UTF-8 string of reasonable length without control characters. No assumptions are made about the structure of the name.

3.1.4 Email address

Source: Taken from the GitHub account metadata.

Storage: Stored as data associated with each token in Redis.

Constraints: Whatever constraints are used by GitHub when adding email addresses to an account.

3.1.5 Group membership

Source: Derived from GitHub organization and team memberships.

Storage: Determined during authentication with GitHub API calls and stored as data associated with each token in Redis.

Constraints: There is no inherent limit in the number of groups a user may be a member of, but be aware that NFS only allows a user to be a member of 16 groups, one of which is the user’s default group. Group memberships above 16 may be ignored by the NFS server.

3.2 Group metadata

3.2.1 Group name

(The below rules only apply to additional groups. The user’s default group has the same name as the username.)

Source: Each team that the user is a member of corresponds to one group. The name of the group is the lowercase form of the organization, a dash (-), and the “slug” of the team as retrieved from the GitHub API. If the resulting group name is longer than 32 characters, it is truncated at 25 characters and the first six characters of a hash of the full name will be appended.

Storage: Group names are stored where user group membership is stored.

Constraints: Group names must consist of lowercase ASCII letters and numbers, period (.), dash (-), and underscore (_), must begin with a letter, and must be at most 32 characters long.

3.2.2 Numeric GID

Source: The team ID from GitHub.

Storage: Stored as data associated with each token in Redis.

Constraints: Whatever constraints GitHub uses to assign team IDs.

4 USDF

This section is still preliminary, since the SLAC USDF is not yet complete. Some of the details may change before the facility is operational.

4.1 User metadata

4.1.1 Username

Source: The value of the sub claim in the ID token returned by the OpenID Connect authentication protocol.

Storage: Stored as data associated with each token in Redis.

Constraints: Must consist solely of lowercase ASCII letters, numbers, and dash (-), must not start or end with a dash, and must not contain two consecutive dashes. [4] Must not consist entirely of numbers.

4.1.2 Numeric UID

Source: The uidNumber attribute of the user’s record in LDAP.

Storage: Stored as data associated with each token in Redis.

Constraints: Whatever constraints are used by the local identity management system that populates LDAP.

4.1.3 Full name

Source: The displayName attribute of the user’s record in LDAP.

Storage: Retrieved from LDAP when needed and not stored locally in the Science Platform.

Constraints: Whatever constraints are used by the local identity management system that populates LDAP. No assumptions are made about the structure of the name.

4.1.4 Email address

Source: The mail attribute of the user’s record in LDAP.

Storage: Retrieved from LDAP when needed and not stored locally in the Science Platform.

Constraints: Whatever constraints are used by the local identity management system that populates LDAP.

4.1.5 Group membership

Source: All groups in LDAP for which the user is listed as a member. Unlike the other deployments, the USDF deployment does not put the user in a default group with the same name as their username.

Storage: Retrieved from LDAP when needed and not stored locally in the Science Platform. However, be aware that the scopes of an authentication token are calculated from the group membership at the time of initial user authentication and are not affected by subsequent changes to the user’s group membership until that token expires.

Constraints: There is no inherent limit in the number of groups a user may be a member of, but be aware that NFS only allows a user to be a member of 16 groups, one of which is the user’s default group. Group memberships above 16 may be ignored by the NFS server.

4.2 Group metadata

4.2.1 Group name

Source: The cn attribute of the LDAP record for the group.

Storage: Retrieved from LDAP when needed and not stored locally in the Science Platform.

Constraints: Group names must consist of ASCII letters (upper- or lowercase) and numbers, period (.), dash (-), and underscore (_), must begin with a letter, and must be at most 32 characters long.

4.2.2 Numeric GID

Source: The gidNumber attribute of the LDAP record for the group.

Storage: Retrieved from LDAP when needed and not stored locally in the Science Platform.

Constraints: Whatever constraints are used by the local identity management system that populates LDAP.