宽容的黑框眼镜 · Folium库使用心得(一)_folium ...· 2 天前 · |
从容的镜子 · 放射冠ct图解剖 - 搜狗图片搜索· 4 月前 · |
豪气的香瓜 · 中国军队新一代迷彩服揭秘:更加贴近实战(图) ...· 4 月前 · |
健身的铁链 · 《快把我哥带走》哥,你落东西了,你把我落下了 ...· 8 月前 · |
讲道义的牛肉面 · 国家秘密设备、产品的保密规定 - 政策法规 ...· 9 月前 · |
奔跑的洋葱 · 外国人不用微信支付宝,他们用什么来手机支付呢 ...· 1 年前 · |
rails gitlab postgresql geo |
https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting.html |
冷静的消炎药
1 年前 |
WARNING: oldest xmin is far in the past
and
pg_wal
size growing
ERROR: replication slots can only be used if max_replication_slots > 0
?
FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist
?
pg_xlog/xlogtemp.123
: No space left on device”
FATAL: could not connect to the primary server: server certificate for "PostgreSQL" does not match host name
LOG: invalid CIDR mask in address
LOG: invalid IP mask "md5": Name or service not known
Found data in the gitlabhq_production database!
when running
gitlab-ctl replicate-geo-database
Synchronization failed - Error syncing repository
ActiveRecord::RecordInvalid: Validation failed: Name has already been taken
ActiveRecord::RecordInvalid: Validation failed: Enabled Geo primary node cannot be disabled
NoMethodError: undefined method `secondary?' for nil:NilClass
sudo: gitlab-pg-ctl: command not found
ERROR - Replication is not up-to-date
during
gitlab-ctl promotion-preflight-checks
ERROR - Replication is not up-to-date
during
gitlab-ctl promote-to-primary-node
--skip-preflight-checks
or
--force
/admin/geo/replication/projects
Received HTTP code 403 from proxy after CONNECT
external_url
for the primary site
Setting up Geo requires careful attention to details, and sometimes it’s easy to miss a step.
Here is a list of steps you should take to attempt to fix problem:
Before attempting more advanced troubleshooting:
A site shows as “Unhealthy” if the site’s status is more than 10 minutes old. In that case, try running the following in the Rails console on the affected secondary site:
Geo::MetricsUpdateWorker.new.perform
If it raises an error, then the error is probably also preventing the jobs from completing. If it takes longer than 10 minutes, then there may be a performance issue, and the UI may always show “Unhealthy” even if the status eventually does get updated.
If it successfully updates the status, then something may be wrong with Sidekiq. Is it running? Do the logs show errors? This job is supposed to be enqueued every minute. It takes an exclusive lease in Redis to ensure that only one of these jobs can run at a time. The primary site updates its status directly in the PostgreSQL database. Secondary sites send an HTTP Post request to the primary site with their status data.
A site also shows as “Unhealthy” if certain health checks fail. You can reveal the failure by running the following in the Rails console on the affected secondary site:
Gitlab::Geo::HealthCheck.new.perform_checks
If it returns
""
(an empty string) or
"Healthy"
, then the checks succeeded. If it returns anything else, then the message should explain what failed, or show the exception message.
For information about how to resolve common error messages reported from the user interface, see Fixing Common Errors .
If the user interface is not working, or you are unable to sign in, you can run the Geo health check manually to get this information and a few more details.
The use of a custom NTP server was introduced in GitLab 15.7.
This Rake task can be run on a Rails node in the primary or secondary Geo sites:
sudo gitlab-rake gitlab:geo:check
Example output:
Checking Geo ...
GitLab Geo is available ... yes
GitLab Geo is enabled ... yes
This machine's Geo node name matches a database record ... yes, found a secondary node named "Shanghai"
GitLab Geo tracking database is correctly configured ... yes
Database replication enabled? ... yes
Database replication working? ... yes
GitLab Geo HTTP(S) connectivity ...
* Can connect to the primary node ... yes
HTTP/HTTPS repository cloning is enabled ... yes
Machine clock is synchronized ... yes
Git user has default SSH configuration? ... yes
OpenSSH configured to use AuthorizedKeysCommand ... yes
GitLab configured to disable writing to authorized_keys file ... yes
GitLab configured to store new projects in hashed storage? ... yes
All projects are in hashed storage? ... yes
Checking Geo ... Finished
You can also specify a custom NTP server using environment variables. For example:
export NTP_HOST="ntp.ubuntu.com"
export NTP_TIMEOUT="30"
sudo gitlab-rake gitlab:geo:check
The following environment variables are supported.
Variable | Description | Default value |
---|---|---|
NTP_HOST
|
The NTP host. |
pool.ntp.org
|
NTP_PORT
|
The NTP port the host listens on. |
ntp
|
NTP_TIMEOUT
|
The NTP timeout in seconds. |
The value defined in the
net-ntp
Ruby library (
60 seconds
).
|
The 3 status items are defined as follows:
To find more details about failed items, check
the
gitlab-rails/geo.log
file
If you notice replication or verification failures, you can try to resolve them .
If there are Repository check failures, you can try to resolve them .
To check if PostgreSQL replication is working, check if:
You should make sure your primary Geo site points to the database node that has write permissions.
Any secondary sites should point only to read-only database nodes.
Geo finds the current Puma or Sidekiq node’s Geo
site
name in
/etc/gitlab/gitlab.rb
with the following logic:
gitlab_rails['geo_node_name']
setting.
global.geo.nodeName
setting (see
Charts with GitLab Geo
).
external_url
setting.
This name is used to look up the Geo site with the same Name in the Geo Sites dashboard.
To check if the current machine has a site name that matches a site in the database, run the check task:
sudo gitlab-rake gitlab:geo:check
It displays the current machine’s site name and whether the matching database record is a primary or secondary site.
This machine's Geo node name matches a database record ... yes, found a secondary node named "Shanghai"
This machine's Geo node name matches a database record ... no
Try fixing it:
You could add or update a Geo node database record, setting the name to "https://example.com/".
Or you could set this machine's Geo node name to match the name of an existing database record: "London", "Shanghai"
For more information see:
doc/administration/geo/replication/troubleshooting.md#can-geo-detect-the-current-node-correctly
For more information about recommended site names in the description of the Name field, see Geo Admin Area Common Settings .
Upload.verification_state_table_class.each_batch do |relation|
relation.update_all(verification_state: 0)
This causes the primary to start checksumming all Uploads. When a primary successfully checksums a record, then all secondaries recalculate the checksum as well, and they compare the values. A similar thing can be done for all Models handled by the Geo Self-Service Framework which have implemented verification:
LfsObject
MergeRequestDiff
Packages::PackageFile
Terraform::StateVersion
SnippetRepository
Ci::PipelineArtifact
PagesDeployment
Upload
Ci::JobArtifact
Ci::SecureFile
GroupWikiRepository
is not in the previous list since verification is not implemented.
There is an issue to implement this functionality in the Admin Area UI.
Message:
WARNING: oldest xmin is far in the past
and
pg_wal
size growing
If a replication slot is inactive,
the
pg_wal
logs corresponding to the slot are reserved forever
(or until the slot is active again). This causes continuous disk usage growth
and the following messages appear repeatedly in the
PostgreSQL logs
:
WARNING: oldest xmin is far in the past
HINT: Close open transactions soon to avoid wraparound problems.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
To fix this:
Connect to the primary database
.
-
Run
SELECT * FROM pg_replication_slots;
.
Note the
slot_name
that reports
active
as
f
(false).
-
Follow
the steps to remove that Geo site
.
Fixing errors found when running the Geo check Rake task
When running this Rake task, you may see error messages if the nodes are not properly configured:
sudo gitlab-rake gitlab:geo:check
Rails did not provide a password when connecting to the database.
Checking Geo ...
GitLab Geo is available ... Exception: fe_sendauth: no password supplied
GitLab Geo is enabled ... Exception: fe_sendauth: no password supplied
Checking Geo ... Finished
Ensure you have the gitlab_rails['db_password']
set to the plain-text
password used when creating the hash for postgresql['sql_user_password']
.
Rails is unable to connect to the database.
Checking Geo ...
GitLab Geo is available ... Exception: FATAL: no pg_hba.conf entry for host "1.1.1.1", user "gitlab", database "gitlabhq_production", SSL on
FATAL: no pg_hba.conf entry for host "1.1.1.1", user "gitlab", database "gitlabhq_production", SSL off
GitLab Geo is enabled ... Exception: FATAL: no pg_hba.conf entry for host "1.1.1.1", user "gitlab", database "gitlabhq_production", SSL on
FATAL: no pg_hba.conf entry for host "1.1.1.1", user "gitlab", database "gitlabhq_production", SSL off
Checking Geo ... Finished
Ensure you have the IP address of the rails node included in postgresql['md5_auth_cidr_addresses']
.
Also, ensure you have included the subnet mask on the IP address: postgresql['md5_auth_cidr_addresses'] = ['1.1.1.1/32']
.
Rails has supplied the incorrect password.
Checking Geo ...
GitLab Geo is available ... Exception: FATAL: password authentication failed for user "gitlab"
FATAL: password authentication failed for user "gitlab"
GitLab Geo is enabled ... Exception: FATAL: password authentication failed for user "gitlab"
FATAL: password authentication failed for user "gitlab"
Checking Geo ... Finished
Verify the correct password is set for gitlab_rails['db_password']
that was
used when creating the hash in postgresql['sql_user_password']
by running
gitlab-ctl pg-password-md5 gitlab
and entering the password.
Check returns not a secondary node
.
Checking Geo ...
GitLab Geo is available ... yes
GitLab Geo is enabled ... yes
GitLab Geo tracking database is correctly configured ... not a secondary node
Database replication enabled? ... not a secondary node
Checking Geo ... Finished
Ensure you have added the secondary site in the Main menu > Admin > Geo > Sites on the web interface for the primary site.
Also ensure you entered the gitlab_rails['geo_node_name']
when adding the secondary site in the Admin Area of the primary site.
In GitLab 12.3 and earlier, edit the secondary site in the Admin Area of the primary
site and ensure that there is a trailing /
in the Name
field.
Check returns Exception: PG::UndefinedTable: ERROR: relation "geo_nodes" does not exist
.
Checking Geo ...
GitLab Geo is available ... no
Try fixing it:
Add a new license that includes the GitLab Geo feature
For more information see:
https://about.gitlab.com/features/gitlab-geo/
GitLab Geo is enabled ... Exception: PG::UndefinedTable: ERROR: relation "geo_nodes" does not exist
LINE 8: WHERE a.attrelid = '"geo_nodes"'::regclass
: SELECT a.attname, format_type(a.atttypid, a.atttypmod),
pg_get_expr(d.adbin, d.adrelid), a.attnotnull, a.atttypid, a.atttypmod,
c.collname, col_description(a.attrelid, a.attnum) AS comment
FROM pg_attribute a
LEFT JOIN pg_attrdef d ON a.attrelid = d.adrelid AND a.attnum = d.adnum
LEFT JOIN pg_type t ON a.atttypid = t.oid
LEFT JOIN pg_collation c ON a.attcollation = c.oid AND a.attcollation <> t.typcollation
WHERE a.attrelid = '"geo_nodes"'::regclass
AND a.attnum > 0 AND NOT a.attisdropped
ORDER BY a.attnum
Checking Geo ... Finished
When performing a PostgreSQL major version (9 > 10), update this is expected. Follow
the initiate-the-replication-process.
Rails does not appear to have the configuration necessary to connect to the Geo tracking database.
Checking Geo ...
GitLab Geo is available ... yes
GitLab Geo is enabled ... yes
GitLab Geo tracking database is correctly configured ... no
Try fixing it:
Rails does not appear to have the configuration necessary to connect to the Geo tracking database. If the tracking database is running on a node other than this one, then you may need to add configuration.
Checking Geo ... Finished
If you are running the secondary site on a single node for all services, then follow Geo database replication - Configure the secondary server. If you are running the secondary site’s tracking database on its own node, then follow Geo for multiple servers - Configure the Geo tracking database on the Geo secondary site
If you are running the secondary site’s tracking database in a Patroni cluster, then follow Geo database replication - Configure the tracking database on the secondary sites
If you are running the secondary site’s tracking database in an external database, then follow Geo with external PostgreSQL instances
If the Geo check task was run on a node which is not running a service which runs the GitLab Rails app (Puma, Sidekiq, or Geo Log Cursor), then this error can be ignored. The node does not need Rails to be configured.
Message: Machine clock is synchronized … Exception
The Ruby gem which performs the check is hard coded with pool.ntp.org
as its reference time source.
In this case, in GitLab 15.7 and newer, specify a custom NTP server using environment variables.
In GitLab 15.6 and older, use one of the following workarounds:
- Add entries in
/etc/hosts
for pool.ntp.org
to direct the request to valid local time servers.
This fixes the long timeout and the timeout error. - Direct the check to any valid IP address. This resolves the timeout issue, but the check fails
with the
No route to host
error, as noted above.
Cloud native GitLab deployments
generate an error because containers in Kubernetes do not have access to the host clock:
Machine clock is synchronized ... Exception: getaddrinfo: Servname not supported for ai_socktype
Message: ActiveRecord::StatementInvalid: PG::ReadOnlySqlTransaction: ERROR: cannot execute INSERT in a read-only transaction
The PostgreSQL read-replica database would be producing these errors:
To resolve the error, follow Step 3. Add the secondary site.
Fixing PostgreSQL database replication errors
The following sections outline troubleshooting steps for fixing replication
error messages (indicated by Database replication working? ... no
in the
geo:check
output.
Message: ERROR: replication slots can only be used if max_replication_slots > 0
?
Be sure to restart PostgreSQL for this to take effect. See the
PostgreSQL replication setup guide for more details.
Message: FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist
?
This occurs when PostgreSQL does not have a replication slot for the
secondary site by that name.
You may want to rerun the replication process on the secondary site .
Message: “Command exceeded allowed execution time” when setting up replication?
This may happen while initiating the replication process on the secondary site,
and indicates your initial dataset is too large to be replicated in the default timeout (30 minutes).
Re-run gitlab-ctl replicate-geo-database
, but include a larger value for
--backup-timeout
:
sudo gitlab-ctl \
replicate-geo-database \
--host=<primary_node_hostname> \
--slot-name=<secondary_slot_name> \
--backup-timeout=21600
This gives the initial replication up to six hours to complete, rather than
the default 30 minutes. Adjust as required for your installation.
Message: “PANIC: could not write to file pg_xlog/xlogtemp.123
: No space left on device”
View your replication slots:
SELECT * FROM pg_replication_slots;
Slots where
active
is
f
are not active.
When this slot should be active, because you have a
secondary
site configured using that slot,
sign in on the web interface for the
secondary
site and check the
PostgreSQL logs
to view why the replication is not running.
-
If you are no longer using the slot (for example, you no longer have Geo enabled), you can remove it with in the
PostgreSQL console session:
SELECT pg_drop_replication_slot('<name_of_extra_slot>');
Message: “ERROR: canceling statement due to conflict with recovery”
These long-running queries are
planned to be removed in the future
,
but as a workaround, we recommend enabling
hot_standby_feedback
.
This increases the likelihood of bloat on the
primary
site as it prevents
VACUUM
from removing recently-dead rows. However, it has been used
successfully in production on GitLab.com.
To enable
hot_standby_feedback
, add the following to
/etc/gitlab/gitlab.rb
on the
secondary
site:
postgresql['hot_standby_feedback'] = 'on'
Then reconfigure GitLab:
sudo gitlab-ctl reconfigure
To help us resolve this problem, consider commenting on
the issue
.
Message:
FATAL: could not connect to the primary server: server certificate for "PostgreSQL" does not match host name
To fix this issue, you can either:
-
Use the
--sslmode=verify-ca
argument with the
replicate-geo-database
command.
-
For an already replicated database, change
sslmode=verify-full
to
sslmode=verify-ca
in
/var/opt/gitlab/postgresql/data/gitlab-geo.conf
and run
gitlab-ctl restart postgresql
.
-
Configure SSL for PostgreSQL
with a custom certificate (including the host name that’s used to connect to the database in the CN or SAN)
instead of using the automatically generated certificate.
Message:
LOG: invalid CIDR mask in address
This happens on wrongly-formatted addresses in
postgresql['md5_auth_cidr_addresses']
.
Message:
LOG: invalid IP mask "md5": Name or service not known
Message:
Found data in the gitlabhq_production database!
when running
gitlab-ctl replicate-geo-database
In GitLab 13.4, a seed project is added when GitLab is first installed. This makes it necessary to pass
--force
even
on a new Geo secondary site. There is an
issue to account for seed projects
when checking the database.
Message:
Synchronization failed - Error syncing repository
If large repositories are affected by this problem,
their resync may take a long time and cause significant load on your Geo sites,
storage and network systems.
Removing the malformed objects causing consistency errors require rewriting the repository history, which is not always an option. However,
it’s possible to override the consistency checks instead. To do that, follow
the instructions in the Gitaly docs
.
You can also get the error message
Synchronization failed - Error syncing repository
along with the following log messages, this indicates that the expected
geo
remote is not present in the
.git/config
file
of a repository on the secondary Geo site’s file system:
{
"created": "@1603481145.084348757",
"description": "Error received from peer unix:/var/opt/gitlab/gitaly/gitaly.socket",
"grpc_message": "exit status 128",
"grpc_status": 13
"grpc.request.fullMethod": "/gitaly.RemoteService/FindRemoteRootRef",
"grpc.request.glProjectPath": "<namespace>/<project>",
"level": "error",
"msg": "fatal: 'geo' does not appear to be a git repository
fatal: Could not read from remote repository. …",
To solve this:
Sign in on the web interface for the secondary Geo site.
-
Back up
the
.git
folder
.
-
Optional.
Spot-check
a few of those IDs whether they indeed correspond
to a project with known Geo replication failures.
Use
fatal: 'geo'
as the
grep
term and the following API call:
curl --request GET --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.example.com/api/v4/projects/<first_failed_geo_sync_ID>"
Enter the Rails console and run:
failed_geo_syncs = Geo::ProjectRegistry.failed.pluck(:id)
failed_geo_syncs.each do |fgs|
puts Geo::ProjectRegistry.failed.find(fgs).project_id
Run the following commands to reset each project’s
Geo-related attributes and execute a new sync:
failed_geo_syncs.each do |fgs|
registry = Geo::ProjectRegistry.failed.find(fgs)
registry.update(resync_repository: true, force_to_redownload_repository: false, repository_retry_count: 0)
Geo::RepositorySyncService.new(registry.project).execute
Very large repositories never successfully synchronize on the
secondary
site
New LFS objects are never replicated
If new LFS objects are never replicated to secondary Geo sites, check the version of
GitLab you are running. GitLab versions 11.11.x or 12.0.x are affected by
a bug that results in new LFS objects not being replicated to Geo secondary sites
.
To resolve the issue, upgrade to GitLab 12.1 or later.
Failures during backfill
During a
backfill
, failures are scheduled to be retried at the end
of the backfill queue, therefore these failures only clear up
after
the backfill completes.
Resetting Geo
secondary
site replication
Stop Sidekiq and the Geo LogCursor.
It’s possible to make Sidekiq stop gracefully, but making it stop getting new jobs and
wait until the current jobs to finish processing.
You need to send a
SIGTSTP
kill signal for the first phase and them a
SIGTERM
when all jobs have finished. Otherwise just use the
gitlab-ctl stop
commands.
gitlab-ctl status sidekiq
# run: sidekiq: (pid 10180) <- this is the PID you will use
kill -TSTP 10180 # change to the correct PID
gitlab-ctl stop sidekiq
gitlab-ctl stop geo-logcursor
You can watch the
Sidekiq logs
to know when Sidekiq jobs processing has finished:
gitlab-ctl tail sidekiq
Rename repository storage folders and create new ones. If you are not concerned about possible orphaned directories and files, you can skip this step.
mv /var/opt/gitlab/git-data/repositories /var/opt/gitlab/git-data/repositories.old
mkdir -p /var/opt/gitlab/git-data/repositories
chown git:git /var/opt/gitlab/git-data/repositories
You may want to remove the /var/opt/gitlab/git-data/repositories.old
in the future
as soon as you confirmed that you don’t need it anymore, to save disk space.
-
Optional. Rename other data folders and create new ones.
You may still have files on the
secondary
site that have been removed from the
primary
site, but this
removal has not been reflected. If you skip this step, these files are not removed from the Geo
secondary
site.
Any uploaded content (like file attachments, avatars, or LFS objects) is stored in a
subfolder in one of these paths:
-
/var/opt/gitlab/gitlab-rails/shared
-
/var/opt/gitlab/gitlab-rails/uploads
To rename all of them:
gitlab-ctl stop
mv /var/opt/gitlab/gitlab-rails/shared /var/opt/gitlab/gitlab-rails/shared.old
mkdir -p /var/opt/gitlab/gitlab-rails/shared
mv /var/opt/gitlab/gitlab-rails/uploads /var/opt/gitlab/gitlab-rails/uploads.old
mkdir -p /var/opt/gitlab/gitlab-rails/uploads
gitlab-ctl start postgresql
gitlab-ctl start geo-postgresql
Reconfigure to recreate the folders and make sure permissions and ownership
are correct:
gitlab-ctl reconfigure
Reset the Tracking Database.
If you skipped the optional step 3, be sure both geo-postgresql
and postgresql
services are running.
gitlab-rake db:drop:geo DISABLE_DATABASE_ENVIRONMENT_CHECK=1 # on a secondary app node
gitlab-ctl reconfigure # on the tracking database node
gitlab-rake db:migrate:geo # on a secondary app node
Restart previously stopped services.
gitlab-ctl start
Design repository failures on mirrored projects and project imports
On the top bar, under
Main menu > Admin > Geo > Sites
,
if the Design repositories progress bar shows
Synced
and
Failed
greater than 100%, and negative
Queued
, the instance
is likely affected by
a bug in GitLab 13.2 and 13.3
.
It was
fixed in GitLab 13.4 and later
.
To determine the actual replication status of design repositories in
a
Rails console
:
secondary = Gitlab::Geo.current_node
counts = {}
secondary.designs.select("projects.id").find_each do |p|
registry = Geo::DesignRegistry.find_by(project_id: p.id)
state = registry ? "#{registry.state}" : "registry does not exist yet"
# puts "Design ID##{p.id}: #{state}" # uncomment this for granular information
counts[state] ||= 0
counts[state] += 1
puts "\nCounts:", counts
Example output:
Design ID#5: started
Design ID#6: synced
Design ID#7: failed
Design ID#8: pending
Design ID#9: synced
Counts:
{"started"=>1, "synced"=>2, "failed"=>1, "pending"=>1}
Example output if there are actually zero design repository replication failures:
Design ID#5: synced
Design ID#6: synced
Design ID#7: synced
Counts:
{"synced"=>3}
If you are promoting a Geo secondary site running on a single node
gitlab-ctl promotion-preflight-checks
fails due to the existence of
failed
rows in the
geo_design_registry
table. Use the
previous snippet
to
determine the actual replication status of Design repositories.
gitlab-ctl promote-to-primary-node
fails since it runs preflight checks.
If the
previous snippet
shows that all designs are synced, you can use the
--skip-preflight-checks
option or the
--force
option to move forward with
promotion.
If you are promoting a Geo secondary site running on multiple servers
gitlab-ctl promotion-preflight-checks
fails due to the existence of
failed
rows in the
geo_design_registry
table. Use the
previous snippet
to
determine the actual replication status of Design repositories.
Sync failure message: “Verification failed with: Error during verification: File is not checksummable”
Missing files on the Geo primary site
Secondaries would regularly try to sync these files again by using the “verification” feature:
A non-atomic backup was restored.
Services or servers or network infrastructure was interrupted/restarted during use.
This behavior affects only the following data types through GitLab 14.6:
Data type | From version |
---|---|
Package Registry | 13.10 |
CI Pipeline Artifacts | 13.11 |
Terraform State Versions | 13.12 |
Infrastructure Registry (renamed to Terraform Module Registry in GitLab 15.11) | 14.0 |
External MR diffs | 14.6 |
LFS Objects | 14.6 |
Pages Deployments | 14.6 |
Uploads | 14.6 |
CI Job Artifacts | 14.6 |
Since GitLab 14.7, files that are missing on the primary site are now treated as sync failures
to make Geo visibly surface data loss risks. The sync/verification loop is
therefore short-circuited.
last_sync_failure
is now set to
The file is missing on the Geo primary site
.
Failed syncs with GitLab-managed object storage replication
There is
an issue in GitLab 14.2 through 14.7
that affects Geo when the GitLab-managed object storage replication is used, causing blob object types to fail synchronization.
Since GitLab 14.2, verification failures result in synchronization failures and cause
a re-synchronization of these objects.
As verification is not implemented for files stored in object storage (see
issue 13845
for more details), this
results in a loop that consistently fails for all objects stored in object storage.
You can work around this by marking the objects as synced and succeeded verification, however
be aware that can also mark objects that may be
missing from the primary
.
To do that, enter the
Rails console
and run:
Gitlab::Geo.verification_enabled_replicator_classes.each do |klass|
updated = klass.registry_class.failed.where(last_sync_failure: "Verification failed with: Error during verification: File is not checksummable").update_all(verification_checksum: '0000000000000000000000000000000000000000', verification_state: 2, verification_failure: nil, verification_retry_at: nil, state: 2, last_sync_failure: nil, retry_at: nil, verification_retry_count: 0, retry_count: 0)
pp "Updated #{updated} #{klass.replicable_name_plural}"
Message: curl 18 transfer closed with outstanding read data remaining & fetch-pack: unexpected disconnect while reading sideband packet
We recommend transferring each failing repository individually and checking for consistency
after each transfer. Follow the
single target
rsync
instructions
to transfer each affected repository from the primary to the secondary site.
Fixing errors during a failover or when promoting a secondary to a primary site
Message:
ActiveRecord::RecordInvalid: Validation failed: Name has already been taken
When
promoting a
secondary
site
,
you might encounter the following error message:
Running gitlab-rake geo:set_secondary_as_primary...
rake aborted!
ActiveRecord::RecordInvalid: Validation failed: Name has already been taken
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/tasks/geo.rake:236:in `block (3 levels) in <top (required)>'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/tasks/geo.rake:221:in `block (2 levels) in <top (required)>'
/opt/gitlab/embedded/bin/bundle:23:in `load'
/opt/gitlab/embedded/bin/bundle:23:in `<main>'
Tasks: TOP => geo:set_secondary_as_primary
(See full trace by running task with --trace)
You successfully promoted this node!
If you encounter this message when running
gitlab-rake geo:set_secondary_as_primary
or
gitlab-ctl promote-to-primary-node
, either:
Enter a Rails console and run:
Rails.application.load_tasks; nil
Gitlab::Geo.expire_cache!
Rake::Task['geo:set_secondary_as_primary'].invoke
Upgrade to GitLab 12.6.3 or later if it is safe to do so. For example,
if the failover was just a test. A
caching-related bug was fixed.
Message:
ActiveRecord::RecordInvalid: Validation failed: Enabled Geo primary node cannot be disabled
If you disabled a secondary site, either with the
replication pause task
(GitLab 13.2) or by using the user interface (GitLab 13.1 and earlier), you must first
re-enable the site before you can continue. This is fixed in GitLab 13.4.
This can be fixed in the database.
Start a database console:
sudo gitlab-rails dbconsole --database main
In GitLab 14.1 and earlier:
sudo gitlab-rails dbconsole
Run the following command, replacing https://<secondary url>/
with the URL
for your secondary node. You can use either http
or https
, but ensure that you
end the URL with a slash (/
):
UPDATE geo_nodes SET enabled = true WHERE url = 'https://<secondary url>/' AND enabled = false;"
This should update one row.
Message:
NoMethodError: undefined method `secondary?' for nil:NilClass
When
promoting a
secondary
site
,
you might encounter the following error message:
sudo gitlab-rake geo:set_secondary_as_primary
rake aborted!
NoMethodError: undefined method `secondary?' for nil:NilClass
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/tasks/geo.rake:232:in `block (3 levels) in <top (required)>'
/opt/gitlab/embedded/service/gitlab-rails/ee/lib/tasks/geo.rake:221:in `block (2 levels) in <top (required)>'
/opt/gitlab/embedded/bin/bundle:23:in `load'
/opt/gitlab/embedded/bin/bundle:23:in `<main>'
Tasks: TOP => geo:set_secondary_as_primary
(See full trace by running task with --trace)
This command is intended to be executed on a secondary site only, and this error message
is displayed if you attempt to run this command on a primary site.
Message:
sudo: gitlab-pg-ctl: command not found
promoting a
secondary
site with multiple nodes
,
you need to run the
gitlab-pg-ctl
command to promote the PostgreSQL
read-replica database.
In GitLab 12.8 and earlier, this command fails with the message:
sudo: gitlab-pg-ctl: command not found
In this case, the workaround is to use the full path to the binary, for example:
sudo /opt/gitlab/embedded/bin/gitlab-pg-ctl promote
GitLab 12.9 and later are
unaffected by this error message
.
Message:
ERROR - Replication is not up-to-date
during
gitlab-ctl promotion-preflight-checks
Message:
ERROR - Replication is not up-to-date
during
gitlab-ctl promote-to-primary-node
Errors when using
--skip-preflight-checks
or
--force
This can happen with XFS or file systems that list files in lexical order, because the
load order of the Omnibus GitLab command files can be different than expected, and a global function would get redefined.
More details can be found in
the related issue
.
The workaround is to manually run the preflight checks and promote the database, by running
the following commands on the Geo secondary site:
sudo gitlab-ctl promotion-preflight-checks
sudo /opt/gitlab/embedded/bin/gitlab-pg-ctl promote
sudo gitlab-ctl reconfigure
sudo gitlab-rake geo:set_secondary_as_primary
Expired artifacts
If you notice for some reason there are more artifacts on the Geo
secondary
site than on the Geo
primary
site, you can use the Rake task
to
cleanup orphan artifact files
.
On a Geo
secondary
site, this command also cleans up all Geo
registry record related to the orphan files on disk.
Fixing sign in errors
Message: The redirect URI included is not valid
Authenticating with SAML on the secondary site always lands on the primary site
This
problem is usually encountered when upgrading to GitLab 15.1
. To fix this problem, see
configuring instance-wide SAML in Geo with Single Sign-On
.
Fixing common errors
Geo database configuration file is missing
GitLab cannot find or doesn’t have permission to access the
database_geo.yml
configuration file.
An existing tracking database cannot be reused
Geo cannot reuse an existing tracking database.
It is safest to use a fresh secondary, or reset the whole secondary by following
Resetting Geo secondary site replication
.
Geo site has a database that is writable which is an indication it is not configured for replication with the primary site
-
An unsupported replication method was used (for example, logical replication).
-
The instructions to set up a
Geo database replication
were not followed correctly.
-
Your database connection details are incorrect, that is you have specified the wrong
user in your
/etc/gitlab/gitlab.rb
file.
Geo
secondary
sites require two separate PostgreSQL instances:
-
A read-only replica of the
primary
site.
-
A regular, writable instance that holds replication metadata. That is, the Geo tracking database.
This error message indicates that the replica database in the
secondary
site is misconfigured and replication has stopped.
To restore the database and resume replication, you can do one of the following:
If you set up a new secondary from scratch, you must also
remove the old site from the Geo cluster
.
Geo site does not appear to be replicating the database from the primary site
The most common problems that prevent the database from replicating correctly are:
Secondary
sites cannot reach the
primary
site. Check credentials and
firewall rules
.
-
SSL certificate problems. Make sure you copied
/etc/gitlab/gitlab-secrets.json
from the
primary
site.
-
Database storage disk is full.
-
Database replication slot is misconfigured.
-
Database is not using a replication slot or another alternative and cannot catch-up because WAL files were purged.
Make sure you follow the
Geo database replication
instructions for supported configuration.
Geo database version (…) does not match latest migration (…)
If you are using Omnibus GitLab installation, something might have failed during upgrade. You can:
GitLab indicates that more than 100% of repositories were synced
This can be caused by orphaned records in the project registry. You can clear them
using a Rake task
.
Geo Admin Area returns 404 error for a secondary site
-
Try restarting
each Rails, Sidekiq and Gitaly nodes on your secondary site
using
sudo gitlab-ctl restart
.
-
Check
/var/log/gitlab/gitlab-rails/geo.log
on Sidekiq nodes to see if the
secondary
site is
using IPv6 to send its status to the
primary
site. If it is, add an entry to
the
primary
site using IPv4 in the
/etc/hosts
file. Alternatively, you should
enable IPv6 on the
primary
site
.
Secondary site returns 502 errors with Geo proxying
When
Geo proxying for secondary sites
is enabled, and the secondary site user interface returns
502 errors, it is possible that the response header proxied from the primary site is too large.
Check the NGINX logs for errors similar to this example:
2022/01/26 00:02:13 [error] 26641#0: *829148 upstream sent too big header while reading response header from upstream, client: 1.2.3.4, server: geo.staging.gitlab.com, request: "POST /users/sign_in HTTP/2.0", upstream: "http://unix:/var/opt/gitlab/gitlab-workhorse/sockets/socket:/users/sign_in", host: "geo.staging.gitlab.com", referrer: "https://geo.staging.gitlab.com/users/sign_in"
To resolve this issue:
-
Set
nginx['proxy_custom_buffer_size'] = '8k'
in
/etc/gitlab.rb
on all web nodes on the secondary site.
-
Reconfigure the
secondary
using
sudo gitlab-ctl reconfigure
.
If you still get this error, you can further increase the buffer size by repeating the steps above
and changing the
8k
size, for example by doubling it to
16k
.
Geo Admin Area shows ‘Unknown’ for health status and ‘Request failed with status code 401’
Geo Admin Area shows ‘Unhealthy’ after enabling Maintenance Mode
In GitLab 13.9 through GitLab 14.3, when
GitLab Maintenance Mode
is enabled, the status of Geo secondary sites stops getting updated. After 10 minutes, the status changes to
Unhealthy
.
Geo secondary sites continue to replicate and verify data, and the secondary sites should still be usable. You can use the
Sync status Rake task
to determine the actual status of a secondary site during Maintenance Mode.
This bug was
fixed in GitLab 14.4
.
Primary site returns 500 error when accessing
/admin/geo/replication/projects
On a Geo primary site this error can be ignored.
This happens because GitLab is attempting to display registries from the
Geo tracking database
which doesn’t exist on the primary site (only the original projects exist on the primary; no replicated projects are present, therefore no tracking database exists).
Secondary site returns 400 error “Request header or cookie too large”
This error can happen when the internal URL of the primary site is incorrect.
To fix this issue, set the primary site’s internal URL to a URL that is:
Enter the
Rails console
on the primary site.
-
Run the following, replacing
https://unique.url.for.primary.site
with your specific internal URL.
For example, depending on your network configuration, you could use an IP address, like
http://1.2.3.4
.
GeoNode.where(primary: true).first.update!(internal_url: "https://unique.url.for.primary.site")
Secondary site returns
Received HTTP code 403 from proxy after CONNECT
If you have installed GitLab using the Linux package (Omnibus) and have configured the
no_proxy
custom environment variable
for Gitaly, you may experience this issue. Affected versions:
-
15.4.6
-
15.5.0
-
15.5.6
15.6.0
-
15.6.3
15.7.0
-
15.7.1
This is due to
a bug introduced in the included version of cURL
shipped with Omnibus GitLab 15.4.6 and later. You are encouraged to upgrade to a later version where this has been
fixed
.
The bug causes all wildcard domains (
.example.com
) to be ignored except for the last on in the
no_proxy
environment variable list. Therefore, if for any reason you cannot upgrade to a newer version, you can work around the issue by moving your wildcard domain to the end of the list:
Edit
/etc/gitlab/gitlab.rb
:
gitaly['env'] = {
"no_proxy" => "sever.yourdomain.org, .yourdomain.com",
Reconfigure GitLab:
sudo gitlab-ctl reconfigure
You can have only one wildcard domain in the
no_proxy
list.
Secondary site shows “Unhealthy” in UI after changing the value of
external_url
for the primary site
In this case, make sure to update the changed URL on all your sites:
Fixing non-PostgreSQL replication failures
If you notice replication failures in
Admin > Geo > Sites
or the
Sync status Rake task
, you can try to resolve the failures with the following general steps:
-
Geo automatically retries failures. If the failures are new and few in number, or if you suspect the root cause is already resolved, then you can wait to see if the failures go away.
-
If failures were present for a long time, then many retries have already occurred, and the interval between automatic retries has increased to up to 4 hours depending on the type of failure. If you suspect the root cause is already resolved, you can
manually retry replication or verification
.
-
If the failures persist, use the following sections to try to resolve them.
Manually retry replication or verification
Adding this ability to other data types is proposed in issue
364725
.
The following sections describe how to use internal application commands in the
Rails console
to cause replication or verification immediately.
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
Blob types
Ci::JobArtifact
Ci::PipelineArtifact
Ci::SecureFile
LfsObject
MergeRequestDiff
Packages::PackageFile
PagesDeployment
Terraform::StateVersion
Upload
Packages::PackageFile
is used in the following
Rails console
examples, but things generally work the same for the other types.
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
The Replicator
Replicate a package file, synchronously, given an ID
Replicate a package file, synchronously, given a registry ID
Find registry records of blobs that failed to sync
Find registry records of blobs that are missing on the primary site
Verify package files on the secondary manually
Reverify all uploads (or any SSF data type which is verified)
SSH into a GitLab Rails node in the primary Geo site.
Open
Rails console
.
Mark all uploads as “pending verification”:
Upload.verification_state_table_class.each_batch do |relation|
relation.update_all(verification_state: 0)
This causes the primary to start checksumming all Uploads.
When a primary successfully checksums a record, then all secondaries recalculate the checksum as well, and they compare the values.
For other SSF data types replace
Upload
in the command above with the desired model class.
Repository types, except for project or project wiki repositories
SnippetRepository
GroupWikiRepository
SnippetRepository
is used in the examples below, but things generally work the same for the other Repository types.
Start a Rails console session
to enact the following, basic troubleshooting steps.
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
The Replicator
Replicate a snippet repository, synchronously, given an ID
Replicate a snippet repository, synchronously, given a registry ID
Project or project wiki repositories
Find repository verification failures
Start a Rails console session
to gather the following, basic troubleshooting information.
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
Get the number of verification failed repositories
Find the verification failed repositories
Find repositories that failed to sync
Resync project and project wiki repositories
Start a Rails console session
to enact the following, basic troubleshooting steps.
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
Queue up all repositories for resync
When you run this, Sidekiq handles each sync.
Sync individual repository now
Find repository check failures in a Geo secondary site
When
enabled for all projects
,
Repository checks
are also performed on Geo secondary sites. The metadata is stored in the Geo tracking database.
Repository check failures on a Geo secondary site do not necessarily imply a replication problem. Here is a general approach to resolve these failures.
-
Find affected repositories as mentioned below, as well as their
logged errors
.
-
Try to diagnose specific
git fsck
errors. The range of possible errors is wide, try putting them into search engines.
-
Test normal functions of the affected repositories. Pull from the secondary, view the files.
-
Check if the primary site’s copy of the repository has an identical
git fsck
error. If you are planning a failover, then consider prioritizing that the secondary site has the same information that the primary site has. Ensure you have a backup of the primary, and follow
planned failover guidelines
.
-
Push to the primary and check if the change gets replicated to the secondary site.
-
If replication is not automatically working, try to manually sync the repository.
Start a Rails console session
to enact the following, basic troubleshooting steps.
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
Get the number of repositories that failed the repository check
Find the repositories that failed the repository check
Recheck repositories that failed the repository check
When you run this,
fsck
is executed against each failed repository.
The
fsck
Rake command
can be used on the secondary site to understand why the repository check might be failing.
Geo::ProjectRegistry.where(last_repository_check_failed: true).each do |pr|
RepositoryCheck::SingleRepositoryWorker.new.perform(pr.project_id)
Fixing client errors
Authorization errors from LFS HTTP(S) client requests
You may have problems if you’re running a version of
Git LFS
before 2.4.2.
As noted in
this authentication issue
,
requests redirected from the secondary to the primary site do not properly send the
Authorization header. This may result in either an infinite
Authorization <-> Redirect
loop, or Authorization error messages.
Error: Net::ReadTimeout when pushing through SSH on a Geo secondary
When you push large repositories through SSH on a Geo secondary site, you may encounter a timeout.
This is because Rails proxies the push to the primary and has a 60 second default timeout,
as described in this Geo issue
.
Current workarounds are:
-
Push through HTTP instead, where Workhorse proxies the request to the primary (or redirects to the primary if Geo proxying is not enabled).
-
Push directly to the primary.
Example log (
gitlab-shell.log
):
Failed to contact primary https://primary.domain.com/namespace/push_test.git\\nError: Net::ReadTimeout\",\"result\":null}" code=500 method=POST pid=5483 url="http://127.0.0.1:3000/api/v4/geo/proxy_git_push_ssh/push"
Recovering from a partial failover
If the above steps are
not successful
, proceed through the next steps:
Check OS locale data compatibility
Geo uses PostgreSQL and Streaming Replication to replicate data across Geo sites. PostgreSQL uses locale data provided by the operating system’s C library for sorting text. If the locale data in the C library is incompatible across Geo sites, erroneous query results that lead to
incorrect behavior on secondary sites
.
For example, Ubuntu 18.04 (and earlier) and RHEL/Centos7 (and earlier) are incompatible with their later releases.
See the
PostgreSQL wiki for more details
.
On all hosts running PostgreSQL, across all Geo sites, run the following shell command:
( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.UTF-8 sort
The output looks like either:
1-1
or the reverse order:
11
If the output is identical on all hosts, then they running compatible versions of locale data.
If the output differs on some hosts, PostgreSQL replication does not work properly: indexes are corrupted on the database replicas. You should select operating system versions that are compatible.
A full index rebuild is required if the on-disk data is transferred ‘at rest’ to an operating system with an incompatible locale, or through replication.
This check is also required when using a mixture of GitLab deployments. The locale might be different between an Linux package install, a GitLab Docker container, a Helm chart deployment, or external database services.
Help & feedback
Docs
Edit this page
to fix an error or add an improvement in a merge request.
Create an issue
to suggest an improvement to this page.
Product
Create an issue
if there's something you don't like about this feature.
Propose functionality
by submitting a feature request.
Join First Look
to help shape new features.
Feature availability and product trials
View pricing
to see all GitLab tiers and features, or to upgrade.
Try GitLab for free
with access to all features for 30 days.
Get Help
If you didn't find what you were looking for,
search the docs
.
If you want help with something specific and could use community support,
post on the GitLab forum
.
For problems setting up or using this feature (depending on your GitLab
subscription).
Request support
从容的镜子 · 放射冠ct图解剖 - 搜狗图片搜索 4 月前 |