Principles of Importer Design
Security
- Uploaded files must be validated. Examples:
Logging
- Logs should contain the importer type such as
github
,bitbucket
,bitbucket_server
. You can find a full list of import sources inGitlab::ImportSources
. - Logs should include any information likely to aid in debugging:
- Object identifiers such as
id
,iid
, and type of object - Error or status messages
- Object identifiers such as
- Logs should not include sensitive or private information, including but not limited to:
- Usernames
- Email addresses
- Where applicable, we should track the error in
Gitlab::Import::ImportFailureService
to aid in displaying errors in the UI. - Logging should raise an error in development if key identifiers are missing, as demonstrated in this MR.
- A log line should be created before and after each record is imported, containing that record’s identifier.
Performance
- A cache with a default TTL of 24 hours should be used to prevent duplicate database queries and API calls.
- Workers that loop over collections should be equipped with a progress pointer that allows them to pick up where they left off if interrupted.
- Write-heavy workers should implement
defer_on_database_health_signal
to avoid saturating the database. However, at the time of writing, a known issue prevents us from using this. - We should enforce limits on worker concurrency to avoid saturating resources. You can find an example of this in the Bitbucket
ParallelScheduling
class. - Importers should be tested at scale on a staging environment, especially when implementing new functionality or enabling a feature flag.
Resilience
- Workers should be idempotent so they can be retried safely in the case of failure.
- Workers should be re-enqueued with a delay that respects concurrent batch limits.
- Individual workers should not run for a long time. Workers that run for a long time can be interrupted by Sidekiq due to a deploy, or be misidentified by
StuckProjectImportJobsWorker
as being part of an import that is stuck and should be failed.- If a worker must run for a long time it must refresh its JID using
Gitlab::Import::RefreshImportJidWorker
to avoid being terminated byStuckProjectImportJobsWorker
. It may also need to raise its Sidekiqmax_retries_after_interruption
. Refer to the GitHub importer implementation.
- If a worker must run for a long time it must refresh its JID using
- Workers that rely on cached values must implement fall-back mechanisms to fetch data in the event of a cache miss.
- Re-fetch data if possible and performant.
- Gracefully handle missing values.
- Long-running workers should be annotated with
worker_resource_boundary :memory
to place them on a shard with a two hour termination grace period. A long termination grace period is not a replacement for writing fast workers. Apdex SLO compliance can be monitored on the I&I team Grafana dashboard. - Workers that create data should not fail an entire import if a single record fails to import. They must log the appropriate error and make a decision on whether or not to retry based on the nature of the error.
- Import Stage workers (which include
StageMethods
) and Advance Stage workers (which includeGitlab::Import::AdvanceStage
) should haveretries: 6
to make them more resilient to system interruptions. With exponential back-off, six retries spans approximately 20 minutes. Any higher retry holds up an import for too long. - It should be possible to retry a portion of an import, for example re-importing missing issues without overwriting the entire destination project.
Consistency
- Importers should fire callbacks after saving records. Problematic callbacks can be disabled for imports on an individual basis:
- Include the
Importable
module. - Configure the callback to skip if
importing?
. - Set the
importing
value on the object under import.
- Include the
- If records must be inserted in bulk, consider manually running callbacks.
Docs
Edit this page to fix an error or add an improvement in a merge request.
Create an issue to suggest an improvement to this page.
Product
Create an issue if there's something you don't like about this feature.
Propose functionality by submitting a feature request.
Feature availability and product trials
View pricing to see all GitLab tiers and features, or to upgrade.
Try GitLab for free with access to all features for 30 days.
Get help
If you didn't find what you were looking for, search the docs.
If you want help with something specific and could use community support, post on the GitLab forum.
For problems setting up or using this feature (depending on your GitLab subscription).
Request support