Import an external repository¶

Reference of the external repository import flow in gitrust: role of the database, complete sequence diagram and optimization avenues.

1. Role of the database¶

During an import, the DB serves four distinct functions:

Rôle	Table	Fréquence
File d'attente persistante	`import_jobs`	1 INSERT à la création
Suivi d'état pour l'UI	`import_jobs`	1 UPDATE toutes les ~1.5 s
Enregistrement final du dépôt	`repositories` + `resources`	2 INSERT à la fin
Journal d'audit	`audit_log`	1 INSERT à la fin

The clone itself never passes through the DB: only metadata and progress counters pass through it. Git objects are written directly to disk in {GIT_REPOS_BASE_PATH}/{owner}/{slug}.git/.

Why persist the state?¶

Resume after restart: the server may crash during a clone lasting several minutes. The table allows these jobs to be marked as failed upon restart.
SSE: the UI reads the progress via an SSE endpoint which polls the DB every 2 s. Without DB, you would need an in-memory channel plus a routing mechanism to the right client.
Multi-browser: if the user closes the tab then reopens it, they find the exact status of the job.
Audit / history: conservation of past imports with their statistics.

2. Current flow diagram¶

sequenceDiagram
  autonumber
  participant UI as Navigateur
  participant H as Handler HTTP
  participant Svc as ImportService
  participant Chan as mpsc Channel
  participant W as ImportWorker
  participant G as git2 (libgit2)
  participant DB as PostgreSQL
  participant FS as Disque

  UI->>H: POST /import
(url, slug, pat)
  H->>Svc: create_job
  Svc->>DB: INSERT import_jobs (pending)
  H->>Chan: try_send(ImportTask+PAT)
  H-->>UI: 302 /imports/{id}

  UI->>H: GET /imports/{id}/stream (SSE)
  Note over UI,H: EventSource ouvert

  Chan->>W: ImportTask
  W->>Svc: mark_running
  Svc->>DB: SELECT import_jobs
UPDATE status=running

  W->>G: RepoBuilder.bare(true).clone()
  G->>FS: init_bare + fetch
  loop callbacks transfer_progress (flood)
    G-->>W: stats (objets/bytes)
    alt throttle 1500 ms écoulé
      W->>Svc: update_progress (tokio::spawn)
      Svc->>DB: SELECT + UPDATE import_jobs
    else dans le throttle
      W--xW: ignore
    end
  end

  loop toutes les 2 s
    H->>DB: SELECT import_jobs WHERE id=?
    DB-->>H: état courant
    H-->>UI: SSE event (JSON)
    UI->>UI: bar.value = percent
  end

  G-->>FS: objets écrits
  G-->>W: Ok

  W->>Svc: check cancel (SELECT import_jobs)
  W->>Svc: update_progress (Finalizing)
  Svc->>DB: SELECT + UPDATE

  W->>RepositoryService: create
  RepositoryService->>DB: INSERT resources
INSERT repositories
  W->>DB: UPDATE repositories
(import_source_url, is_empty=false)
  W->>Svc: mark_success
  Svc->>DB: SELECT + UPDATE import_jobs
(status=success, duration)
  W->>AuditService: log REPO_IMPORTED
  AuditService->>DB: INSERT audit_log

  H->>DB: SELECT (prochain tick SSE)
  H-->>UI: terminal=true
  UI->>UI: window.location.reload()

3. DB cost per import¶

For a clone of 90 seconds with throttle 1500 ms:

Étape	Opérations DB	Cumul
create_job	1 INSERT	1
mark_running	1 SELECT + 1 UPDATE	3
update_progress pendant clone	~60 x (1 SELECT + 1 UPDATE)	123
check cancel	1 SELECT	124
update_progress finalizing	1 SELECT + 1 UPDATE	126
RepositoryService::create	2 INSERT (repo + resource)	128
update repositories	1 UPDATE	129
mark_success	1 SELECT + 1 UPDATE	131
audit log	1 INSERT	132
SSE stream (45 ticks à 2 s)	45 SELECT	177

~177 requests for an import, of which 120+ are progress updates.

4. Optimization paths (without changing libgit2)¶

4.1 Eliminate the SELECT before UPDATE in `update_progress`¶

update_progress currently does load_model (SELECT) then active.update(db) (UPDATE + RETURNING). The SELECT is redundant: we know the ID, we just want to patch 4 columns.

Gain: divides the progress requests by 2 (~60 requests less for a 90 s clone).

Suggested redesign with UpdateMany (pattern already used in ci_service.rs::cancel_running_pipelines):

import_job::Entity::update_many()
    .col_expr(Column::ReceivedObjects, Expr::value(received as i32))
    .col_expr(Column::TotalObjects, Expr::value(total as i32))
    .col_expr(Column::ReceivedBytes, Expr::value(bytes as i64))
    .col_expr(Column::Phase, Expr::value(phase.as_str()))
    .col_expr(Column::UpdatedAt, Expr::value(Utc::now()))
    .filter(Column::Id.eq(job_id))
    .exec(db)
    .await?;

Same principle for mark_running, mark_success, mark_failed, mark_cancelled → all divide their DB round trips by 2.

4.2 Downsample the progression in the worker (dedicated)¶

Alternative plan: the callback simply writes to a tokio::sync::watch<TransferStats> (in memory, zero DB). A dedicated task consumes this channel and UPDATE the DB at 1 Hz only.

flowchart LR
  G[git2 callback] -->|watch::send
très haute fréquence| W[watch channel]
  W --> T[Tâche update
ticker 1 Hz]
  T -->|1 UPDATE / s| DB

Gains : - 1 single DB transmitter instead of N competing tokio::spawn → zero contention on the pool - Deterministic frequency (1 Hz) independent of network speed - Clearer code (the task is async, we can await directly)

4.3 Remove implicit `RETURNING` from SeaORM¶

ActiveModel::update(db) returns the complete model via ... RETURNING *. For progress updates, we do not use feedback. UpdateMany::exec does not emit RETURNING → fewer bytes on the wire, less deserialization.

4.4 Dedicated DB pool for the worker¶

Today the worker shares the main pool with the HTTP handlers. A pool of 4-8 connections dedicated to the worker would prevent an import from saturating the pool and causing user requests to time out.

4.5 Debounce the server-side SSE¶

The SSE SELECT every 2 s even when nothing changes. Alternative: PostgreSQL LISTEN/NOTIFY on a string import_job_{uuid}. Each UPDATE emits a NOTIFY, the SSE connection does LISTEN → direct push, zero polling.

Cost: requires a dedicated DB connection by SSE (LISTEN is stateful). Compromise to be evaluated.

5. Major optimization: shell-out to `git clone --bare`¶

Independent of DB optimizations but by far the biggest gain. libgit2 is structurally 2-5x slower than git CLI on large HTTPS clones (no multiplexing, single-threaded delta resolution, etc.).

flowchart TB
  subgraph Actuel[Flux actuel — libgit2]
    A1[RepoBuilder::clone] --> A2[libgit2: fetch + resolve]
    A2 -->|lent 2-5x| A3[bare repo]
  end
  subgraph Cible[Flux optimisé — git CLI]
    B1[Command::new git] --> B2[git clone --bare --progress]
    B2 -->|stderr| B3[Parser regex
Receiving objects: X%]
    B3 --> B4[update watch channel]
    B2 -->|vitesse native| B5[bare repo]
  end

Avantages : - Native speed (parity with git CLI) - Native progress via stderr Receiving objects: 42% (368/876), 1.10 MiB | 2.05 MiB/s - HEAD correctly positioned (no need for the RepoBuilder workaround) - Less dependency on libgit2 for the case of the initial clone

Counterpart: git binary required on the server (already the case — used by SbomService::git_archive).

6. Recommended priorities¶

If we wanted to optimize now, proposed order:

Shell-out git clone --bare — real gain for the user (3-5x faster clone).
update_many for update_progress — divides DB pressure by 2, mechanical change of ~20 lines.
watch channel + 1 Hz task — cleaner architecture, eliminates pooling issues.
Dedicated worker DB pool — operational safety net.
LISTEN/NOTIFY for SSE — only if many simultaneous clients.

Gains 2-4 are useful even with libgit2; gain 1 is the most visible to the end user.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search