Pseudonymizer

Your GitLab database contains sensitive information. To protect sensitive information when you run analytics on your database, you can use the Pseudonymizer service, which:

  1. Uses HMAC(SHA256) to mutate fields containing sensitive information.
  2. Preserves references (referential integrity) between fields.
  3. Exports your GitLab data, scrubbed of sensitive material.
caution
If the source data is available, users can compare and correlate the scrubbed data with the original.

To generate a pseudonymized data set:

  1. Configure Pseudonymizer fields and output location.
  2. Enable Pseudonymizer data collection.
  3. Optional. Generate a data set manually.

Configure Pseudonymizer

To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to store the scrubbed data:

  1. Create a manifest file: This file describes the fields to include or pseudonymize.
    • Default manifest - GitLab provides a default manifest in your GitLab installation (example manifest.yml file). To use the example manifest file, use the config/pseudonymizer.yml relative path when you configure connection parameters.
    • Custom manifest - To use a custom manifest file, use the absolute path to the file when you configure the connection parameters.
  2. Configure connection parameters: In the configuration method appropriate for your version of GitLab, specify the object storage connection parameters (pseudonymizer.upload.connection).

For Omnibus installations:

  1. Edit /etc/gitlab/gitlab.rb and add the following lines by replacing with the values you want:

    gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
    gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
    gitlab_rails['pseudonymizer_upload_connection'] = {
      'provider' => 'AWS',
      'region' => 'eu-central-1',
      'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
      'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
    }
    

    If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs.

    gitlab_rails['pseudonymizer_upload_connection'] = {
      'provider' => 'AWS',
      'region' => 'eu-central-1',
      'use_iam_profile' => true
    }
    
  2. Save the file and reconfigure GitLab for the changes to take effect.


For installations from source:

  1. Edit /home/git/gitlab/config/gitlab.yml and add or amend the following lines:

    pseudonymizer:
      manifest: config/pseudonymizer.yml
      upload:
        remote_directory: 'gitlab-elt' # bucket name
        connection:
          provider: AWS
          aws_access_key_id: AWS_ACCESS_KEY_ID
          aws_secret_access_key: AWS_SECRET_ACCESS_KEY
          region: eu-central-1
    
  2. Save the file and restart GitLab for the changes to take effect.

Enable Pseudonymizer data collection

To enable data collection:

  1. On the top bar, select Menu > Admin.
  2. On the left sidebar, select Settings > Metrics and Profiling, then expand Pseudonymizer data collection.
  3. Select Enable Pseudonymizer data collection.
  4. Select Save changes.

Generate data set manually

You can also run the Pseudonymizer manually:

  1. Set these environment variables:
    • PSEUDONYMIZER_OUTPUT_DIR - Where to store the output CSV files. Defaults to /tmp. These commands produce CSV files that can be quite large. Make sure the directory can store a file at least 10% of the size of your database.
    • PSEUDONYMIZER_BATCH - The batch size when querying the database. Defaults to 100000.
  2. Run the command appropriate for your application:
    • Omnibus GitLab: sudo gitlab-rake gitlab:db:pseudonymizer
    • Installations from source: sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production

After you run the command, upload the output CSV files to your configured object storage. After the upload completes, delete the output file from the local disk.