- Business Model
- Data Essentials
- Infrastructure Monitoring
- Virtualization and Externalization
- What virtualization requirements have been defined for the application?
- If applicable, what approach(es) to cloud computing will be taken (Managed Hosting versus “Pure” Cloud, a “full machine” approach such as AWS-EC2 versus a “hosted database” approach such as AWS-RDS and Azure, etc)?
- What frameworks and programming languages have been used to create the application?
- What process, code, or infrastructure dependencies have been defined for the application?
- What databases and application servers support the application?
- How will database connection strings, encryption keys, and other sensitive components be stored, accessed, and protected from unauthorized detection?
- Data Processing
- What data entry paths does the application support?
- What data output paths does the application support?
- How does data flow across the application’s internal components?
- What data input validation requirements have been defined?
- What data does the application store and how?
- What data is or may need to be encrypted and what key management requirements have been defined?
- What capabilities exist to detect the leakage of sensitive data?
- What encryption requirements have been defined for data in transit - including transmission over WAN, LAN, SecureFTP, or publicly accessible protocols such as http: and https:?
- What user privilege levels does the application support?
- What user identification and authentication requirements have been defined?
- What user authorization requirements have been defined?
- What session management requirements have been defined?
- What access requirements have been defined for URI and Service calls?
- Application Monitoring
The following security review of the Geo feature set focuses on security aspects of the feature as they apply to customers running their own GitLab instances. The review questions are based in part on the OWASP Application Security Verification Standard Project from owasp.org.
- This varies by customer. Geo allows customers to deploy to multiple areas, and they get to choose where they are.
- Region and node selection is entirely manual.
- Geo streams almost all data held by a GitLab instance between sites. This includes full database replication, most files (user-uploaded attachments, etc) and repository + wiki data. In a typical configuration, this will happen across the public Internet, and be TLS-encrypted.
- PostgreSQL replication is TLS-encrypted.
- See also: only TLSv1.2 should be supported
- GitLab’s model of sensitivity is centered around public vs. internal vs. private projects. Geo replicates them all indiscriminately. “Selective sync” exists for files and repositories (but not database content), which would permit only less-sensitive projects to be replicated to a secondary node if desired.
- See also: GitLab data classification policy.
- Geo is designed to provide replication of a certain subset of the application data. It is part of the solution, rather than part of the problem.
- Secondary nodes are created in regions that are distant (in terms of Internet latency) from the main GitLab installation (the primary node). They are intended to be used by anyone who would ordinarily use the primary node, who finds that the secondary node is closer to them (in terms of Internet latency).
- Secondary nodes provide all the interfaces a primary node does (notably a HTTP/HTTPS web application, and HTTP/HTTPS or SSH Git repository access), but is constrained to read-only activities. The principal use case is envisioned to be cloning Git repositories from the secondary node in favor of the primary node, but end-users may use the GitLab web interface to view projects, issues, merge requests, snippets, etc.
- The replication process must be secure. It would typically be unacceptable to transmit the entire database contents or all files and repositories across the public Internet in plaintext, for instance.
- Secondary nodes must have the same access controls over its content as the primary node - unauthenticated users must not be able to gain access to privileged information on the primary node by querying the secondary node.
- Attackers must not be able to impersonate the secondary node to the primary node, and thus gain access to privileged information.
- Nothing Geo-specific. Any user where
admin: trueis set in the database is considered an admin with super-user privileges.
- See also: more granular access control (not geo-specific)
- Much of Geo’s integration (database replication, for instance) must be configured with the application, typically by system administrators.
- Secondary nodes may be added, modified, or removed by users with administrative access.
- The replication process may be controlled (start/stop) via the Sidekiq administrative controls.
- Geo requires the primary node and secondary node to be able to communicate with each other across a TCP/IP network. In particular, the secondary nodes must be able to access HTTP/HTTPS and PostgreSQL services on the primary node.
- Varies from customer to customer.
- Maximum replication speeds between primary node and secondary node is limited by the available bandwidth between sites. No hard requirements exist - time to complete replication (and ability to keep up with changes on the primary node) is a function of the size of the data set, tolerance for latency, and available network capacity.
- Customers choose their own networks. As sites are intended to be geographically separated, it is envisioned that replication traffic will pass over the public Internet in a typical deployment, but this is not a requirement.
- Geo imposes no additional restrictions on operating system (see the GitLab installation page for more details), however we recommend using the operating systems listed in the Geo documentation.
- The supported installation method (Omnibus) packages most components itself.
- There are significant dependencies on the system-installed OpenSSH daemon (Geo requires users to set up custom authentication methods) and the omnibus or system-provided PostgreSQL daemon (it must be configured to listen on TCP, additional users and replication slots must be added, etc).
- The process for dealing with security updates (for example, if there is a significant vulnerability in OpenSSH or other services, and the customer wants to patch those services on the OS) is identical to the non-Geo situation: security updates to OpenSSH would be provided to the user via the usual distribution channels. Geo introduces no delay there.
- None specific to Geo.
- None specific to Geo.
- None specific to Geo.
- Nothing Geo-specific, but everything in GitLab needs to have full functionality in such an environment.
- GitLab is “cloud native” and this applies to Geo as much as to the rest of the product. Deployment in clouds is a common and supported scenario.
If applicable, what approach(es) to cloud computing will be taken (Managed Hosting versus “Pure” Cloud, a “full machine” approach such as AWS-EC2 versus a “hosted database” approach such as AWS-RDS and Azure, etc)?
- To be decided by our customers, according to their operational needs.
- Ruby on Rails, Ruby.
- Nothing specific to Geo.
- PostgreSQL >= 9.6, Redis, Sidekiq, Unicorn.
How will database connection strings, encryption keys, and other sensitive components be stored, accessed, and protected from unauthorized detection?
- There are some Geo-specific values. Some are shared secrets which must be
securely transmitted from the primary node to the secondary node at setup time. Our
documentation recommends transmitting them from the primary node to the system
administrator via SSH, and then back out to the secondary node in the same manner.
In particular, this includes the PostgreSQL replication credentials and a secret
db_key_base) which is used to decrypt certain columns in the database. The
db_key_basesecret is stored unencrypted on the filesystem, in
/etc/gitlab/gitlab-secrets.json, along with a number of other secrets. There is no at-rest protection for them.
- Data is entered via the web application exposed by GitLab itself. Some data is
also entered using system administration commands on the GitLab servers (e.g.,
- Secondary nodes also receive inputs via PostgreSQL streaming replication from the primary node.
- Primary nodes output via PostgreSQL streaming replication to the secondary node.
Otherwise, principally via the web application exposed by GitLab itself, and via
git cloneoperations initiated by the end-user.
- Secondary nodes and primary nodes interact via HTTP/HTTPS (secured with JSON web tokens) and via PostgreSQL streaming replication.
- Within a primary node or secondary node, the SSOT is the filesystem and the database (including Geo tracking database on secondary node). The various internal components are orchestrated to make alterations to these stores.
- Secondary nodes must have a faithful replication of the primary node’s data.
- Git repositories and files, tracking information related to the them, and the GitLab database contents.
- Neither primary nodes or secondary nodes encrypt Git repository or filesystem data at
rest. A subset of database columns are encrypted at rest using the
- A static secret shared across all hosts in a GitLab deployment.
- In transit, data should be encrypted, although the application does permit communication to proceed unencrypted. The two main transits are the secondary node’s replication process for PostgreSQL, and for Git repositories/files. Both should be protected using TLS, with the keys for that managed via Omnibus per existing configuration for end-user access to GitLab.
- Comprehensive system logs exist, tracking every connection to GitLab and PostgreSQL.
What encryption requirements have been defined for data in transit - including transmission over WAN, LAN, SecureFTP, or publicly accessible protocols such as http: and https:?
- Data must have the option to be encrypted in transit, and be secure against both passive and active attack (e.g., MITM attacks should not be possible).
- Geo adds one type of privilege: secondary nodes can access a special Geo API to download files over HTTP/HTTPS, and to clone repositories using HTTP/HTTPS.
- Secondary nodes identify to Geo primary nodes via OAuth or JWT authentication based on the shared database (HTTP access) or a PostgreSQL replication user (for database replication). The database replication also requires IP-based access controls to be defined.
- Secondary nodes must only be able to read data. They are not currently able to mutate data on the primary node.
- Geo JWTs are defined to last for only two minutes before needing to be regenerated.
- Geo JWTs are generated for one of the following specific scopes:
- Geo API access.
- Git access.
- LFS and File ID.
- Upload and File ID.
- Job Artifact and File ID.
- Secondary nodes make many calls to the primary node’s API. This is how file replication proceeds, for instance. This endpoint is only accessible with a JWT token.
- The primary node also makes calls to the secondary node to get status information.
What application auditing requirements have been defined? How are audit and debug logs accessed, stored, and secured?
- Structured JSON log is written to the filesystem, and can also be ingested into a Kibana installation for further analysis.