The Banzai pipeline and parsing

Parsing and rendering GitLab Flavored Markdown into HTML involves different components:

  • Banzai pipeline and it’s various filters
  • Markdown parser

The backend does all the processing for GLFM to HTML. This provides several benefits:

  • Security: We run robust sanitization which removes unknown tags, classes and ids.
  • References: Our reference syntax requires access to the database to resolve issues, etc, as well as redacting references in which the user has no access.
  • Consistency: We want to provide users with a consistent experience, which includes full support of the GLFM syntax and styling. Having a single place where the processing is done allows us to provide that.
  • Caching: We cache the HTML in our database when possible, such as for issue or MR descriptions, or comments.
  • Quick actions: We use a specialized pipeline to process quick actions, so that we can better detect them in Markdown text.

The frontend handles certain aspects when displaying:

  • Math blocks
  • Mermaid blocks
  • Enforcing certain limits, such as excessive number of math or mermaid blocks.

The Banzai pipeline

Named after the surf reef break in Hawaii, the Banzai pipeline consists of various filters (lib/banzai/filters) where Markdown and HTML is transformed in each one, in a pipeline fashion. Various pipelines (lib/banzai/pipeline) are defined, each with a different sequence of filters, such as AsciiDocPipeline, EmailPipeline.

The html-pipeline gem implements the pipeline/filter mechanism.

The primary pipeline is the FullPipeline, which is a combiantion of the PlainMarkdownPipeline and the GfmPipeline.

PlainMarkdownPipeline

This pipeline contains the filters for transforming raw Markdown into HTML, handled primarily by the Filter::MarkdownFilter.

Filter::MarkdownFilter

This filter interfaces with the actual Markdown parser. The primary parser uses our gitlab-glfm-markdown Ruby gem that uses the comrak Rust crate.

A secondary deprecated parser engine uses the commonmarker Ruby gem to interact with the cmark-gfm library.

Text is passed into this filter, and by calling the specified parser engine, generates the corresponding basic HTML.

GfmPipeline

This pipeline contains all the filters that perform the additional transformations on raw HTML into what we consider rendered GLFM. A Nokogiri document gets passed into each of these filters, and they perform the various transformations. For example, EmojiFitler, CommitTrailersFilter, or SanitizationFilter. Anything that can’t be handled by the initial Markdown parsing gets handled by these filters.

Of specific note is the SanitizationFilter. This is critical for providing safe HTML from possibly malicious input.

Performance

It’s important to not only have the filters run as fast as possible, but to ensure that they don’t take too long in general. For this we use several techniques:

  • For certain filters that can take a long time, we use a Ruby timeout with Gitlab::RenderTimeout.timeout in TimeoutFilterHandler. This allows us to interrupt the actual processing if it takes too long. It’s important to note that in general using Ruby timeout is not considered safe. We therefore only use it when absolutely necessary, preferring to fix an actual performance problem rather then using a timeout.
  • PipelineTimingCheck allows us to keep track of the cumulative amount of time the pipeline is taking. When we reach a maximum, we can then skip any remaining filters. For nearly all filters, it’s generally ok to skip them in a case like this in order to show the user something, rather than nothing.

    However, there are a couple instances where this is not advisable. For example in the SanitizationFilter, if that filter does not complete, then we can’t show the HTML to the user since there could still be unsanitized HTML. In those cases, we have to show an error message.

There is also a rake task that can be used for benchmarking. See the Performance Guidelines

Markdown parser

We use our gitlab-glfm-markdown Ruby gem that uses the comrak Rust crate.

comrak provides 100% compatibility with GFM and CommonMark while allowing additional extensions to be added to it. For example, we were able to implement our multi-line blockquote and wikilink syntax directly in comrak. The goal is to move more of the Ruby filters into either comrak (if it makes sense) or into gitlab-glfm-markdown.

Please see glfm_markdown.rb for the various options that get passed into comrak.

Debugging

Usually the easiest way to debug the various pipelines and filters is to run them from the Rails console. This way you can set a binding.pry in a filter and step through the code.

Because of TimeoutFilterHandler and PipelineTimingCheck, it can be a challenge to debug the filters. There is a special environment variable, GITLAB_DISABLE_MARKDOWN_TIMEOUT, that when set disables any timeout checking in the filters. This is also available for customers in the rare instance that a self-managed instance wishes to bypass those checks.

text = 'Some test **Markdown**'
html = Banzai.render(text, project: nil)

This renders the Markdown in relation to no project. Or you can render it in the context of a project:

project = Project.first
text = 'Some test **Markdown**'
html = Banzai.render(text, project: project)

The render method takes the text and a context hash, which provides various options for rendering. For example you can use pipeline: :ascii_doc to run the AsciiDocPipeline. The FullPipeline is the default.

If you specify debug_timing: true, then you will receive a list of filters and how long each takes.

Banzai.render(text, project: nil, debug_timing: true)

D, [2024-12-20T13:35:24.246463 #34584] DEBUG -- : 0.000012_s (0.000012_s): NormalizeSourceFilter [PreProcessPipeline]
D, [2024-12-20T13:35:24.246543 #34584] DEBUG -- : 0.000007_s (0.000019_s): TruncateSourceFilter [PreProcessPipeline]
D, [2024-12-20T13:35:24.246589 #34584] DEBUG -- : 0.000028_s (0.000047_s): FrontMatterFilter [PreProcessPipeline]
D, [2024-12-20T13:35:24.246662 #34584] DEBUG -- : 0.000005_s (0.000005_s): IncludeFilter [FullPipeline]
D, [2024-12-20T13:35:24.246684 #34584] DEBUG -- : 0.000003_s (0.000008_s): MarkdownPreEscapeLegacyFilter [FullPipeline]
D, [2024-12-20T13:35:24.246699 #34584] DEBUG -- : 0.000002_s (0.000010_s): DollarMathPreLegacyFilter [FullPipeline]
D, [2024-12-20T13:35:24.246715 #34584] DEBUG -- : 0.000003_s (0.000013_s): BlockquoteFenceLegacyFilter [FullPipeline]
D, [2024-12-20T13:35:24.246816 #34584] DEBUG -- : 0.000088_s (0.000101_s): MarkdownFilter [FullPipeline]
...
D, [2024-12-20T13:35:24.252338 #34584] DEBUG -- : 0.000013_s (0.004394_s): CustomEmojiFilter [FullPipeline]
D, [2024-12-20T13:35:24.252504 #34584] DEBUG -- : 0.000095_s (0.004489_s): TaskListFilter [FullPipeline]
D, [2024-12-20T13:35:24.252558 #34584] DEBUG -- : 0.000028_s (0.004517_s): SetDirectionFilter [FullPipeline]
D, [2024-12-20T13:35:24.252623 #34584] DEBUG -- : 0.000045_s (0.004562_s): SyntaxHighlightFilter [FullPipeline]

Use debug: true for even more detail per filter.