[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PATH WALK II: Add --path-walk option to 'git pack-objects' #1819

Open
wants to merge 12 commits into
base: api-upstream
Choose a base branch
from

Conversation

derrickstolee
Copy link
@derrickstolee derrickstolee commented Oct 29, 2024

WIP

Thanks, -Stolee

Here is the range-diff from the equivalent patches in the previous version:
 1:  3fdb57edbc5 !  1:  6365d8148c4 pack-objects: extract should_attempt_deltas()
    @@ Metadata
      ## Commit message ##
         pack-objects: extract should_attempt_deltas()
     
    -    This will be helpful in a future change that introduces a new way to
    -    compute deltas.
    -
    -    Be careful to preserve the nr_deltas counting logic in the existing
    -    method, but take the rest of the logic wholesale.
    +    This will be helpful in a future change.
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
 2:  a0475c7cba8 =  2:  0a4c2c2da0e pack-objects: add --path-walk option
 3:  73c8b61e87b =  3:  df74fe9a35b pack-objects: update usage to match docs
 4:  21dc3723c36 !  4:  4e08947ea0a p5313: add performance tests for --path-walk
    @@ Commit message
     
         Running on my copy of the Git repository results in this data:
     
    -    Test                                      this tree
    -    ---------------------------------------------------------
    -    5313.2: thin pack                         0.01(0.00+0.00)
    -    5313.3: thin pack size                               1.1K
    -    5313.4: thin pack with --path-walk        0.01(0.01+0.00)
    -    5313.5: thin pack size with --path-walk              1.1K
    -    5313.6: big pack                          2.52(6.59+0.38)
    -    5313.7: big pack size                               14.1M
    -    5313.8: big pack with --path-walk         4.90(5.76+0.26)
    +    Test                                      HEAD
    +    --------------------------------------------------------------
    +    5313.2: thin pack                         0.00(0.00+0.00)
    +    5313.3: thin pack size                                589
    +    5313.4: thin pack with --path-walk        0.00(0.00+0.00)
    +    5313.5: thin pack size with --path-walk               589
    +    5313.6: big pack                          2.76(7.19+0.27)
    +    5313.7: big pack size                               14.0M
    +    5313.8: big pack with --path-walk         5.76(6.72+0.16)
         5313.9: big pack size with --path-walk              13.2M
     
         Note that the timing is slower because there is no threading in the
    @@ Commit message
     
         Running the tests on this repo results in the following output:
     
    -    Test                                      this tree
    -    ----------------------------------------------------------
    -    5313.2: thin pack                         0.28(0.38+0.02)
    +    Test                                      HEAD
    +    --------------------------------------------------------------
    +    5313.2: thin pack                         0.28(0.40+0.03)
         5313.3: thin pack size                               1.2M
    -    5313.4: thin pack with --path-walk        0.08(0.06+0.01)
    +    5313.4: thin pack with --path-walk        0.07(0.06+0.00)
         5313.5: thin pack size with --path-walk             18.4K
    -    5313.6: big pack                          4.05(29.62+0.43)
    -    5313.7: big pack size                               20.0M
    -    5313.8: big pack with --path-walk         5.99(9.06+0.24)
    -    5313.9: big pack size with --path-walk              16.4M
    +    5313.6: big pack                          4.05(29.48+0.41)
    +    5313.7: big pack size                               19.7M
    +    5313.8: big pack with --path-walk         6.01(9.17+0.20)
    +    5313.9: big pack size with --path-walk              16.5M
     
         Notice in particular that in the small thin pack, the time performance
         has improved from 0.28s to 0.08s and this is likely due to the improved
    @@ Commit message
         Finally, running this on a copy of the Linux kernel repository results
         in these data points:
     
    -    Test                                      this tree
    -    -----------------------------------------------------------
    -    5313.2: thin pack                         0.00(0.00+0.00)
    -    5313.3: thin pack size                               5.8K
    -    5313.4: thin pack with --path-walk        0.00(0.01+0.00)
    -    5313.5: thin pack size with --path-walk              5.8K
    -    5313.6: big pack                          24.39(65.81+1.31)
    -    5313.7: big pack size                              155.7M
    -    5313.8: big pack with --path-walk         41.07(60.69+0.68)
    -    5313.9: big pack size with --path-walk             150.8M
    +    Test                                      HEAD
    +    --------------------------------------------------------------
    +    5313.2: thin pack                         0.03(0.02+0.00)
    +    5313.3: thin pack size                               4.6K
    +    5313.4: thin pack with --path-walk        0.03(0.01+0.01)
    +    5313.5: thin pack size with --path-walk              4.6K
    +    5313.6: big pack                          21.06(60.57+1.45)
    +    5313.7: big pack size                              158.3M
    +    5313.8: big pack with --path-walk         37.65(57.83+0.67)
    +    5313.9: big pack size with --path-walk             152.3M
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
 5:  6f96b1c227a =  5:  9235f9bb9c8 pack-objects: introduce GIT_TEST_PACK_PATH_WALK
 -:  ----------- >  6:  201b2210712 t5538: add tests to confirm deltas in shallow pushes
 6:  834c9ea2709 !  7:  5caf28ec6c7 repack: add --path-walk option
    @@ Commit message
         through in 'git repack'. This presents interesting testing opportunities for
         comparing the different repacking strategies against each other.
     
    -    In my copy of the Git repository, the new tests in p5313 show these
    -    results:
    +    Add the --path-walk option to the performance tests in p5313.
    +
    +    For the microsoft/fluentui repo [1] checked out at a specific commit [2],
    +    the results are very interesting:
    +
    +    Test                                      HEAD
    +    ----------------------------------------------------------
    +    5313.2: thin pack                         0.41(0.48+0.03)
    +    5313.3: thin pack size                               1.2M
    +    5313.4: thin pack with --path-walk        0.08(0.05+0.02)
    +    5313.5: thin pack size with --path-walk             18.4K
    +    5313.6: big pack                          4.47(30.62+0.40)
    +    5313.7: big pack size                               19.6M
    +    5313.8: big pack with --path-walk         6.76(9.87+0.23)
    +    5313.9: big pack size with --path-walk              16.5M
    +    5313.10: repack                           96.87(664.29+2.75)
    +    5313.11: repack size                               439.5M
    +    5313.12: repack with --path-walk          95.68(109.90+0.92)
    +    5313.13: repack size with --path-walk              122.6M
     
    -    Test                                      this tree
    -    -------------------------------------------------------------
    -    5313.10: repack                           27.88(150.23+2.70)
    -    5313.11: repack size                               228.2M
    -    5313.12: repack with --path-walk          134.59(148.77+0.81)
    -    5313.13: repack size with --path-walk              209.7M
    +    [1] https://github.com/microsoft/fluentui
    +    [2] e70848ebac1cd720875bccaa3026f4a9ed700e08
     
    -    Note that the 'git pack-objects --path-walk' feature is not integrated
    -    with threads. Look forward to a future change that will introduce
    -    threading to improve the time performance of this feature with
    -    equivalent space performance.
    +    This repo suffers from having a lot of paths that collide in the name
    +    hash, so examining them in groups by path leads to better deltas. Also,
    +    in this case, the single-threaded implementation is competitive with the
    +    full repack. This is saving time diffing files that have significant
    +    differences from each other.
     
    -    For the microsoft/fluentui repo [1] had some interesting aspects for the
    -    previous tests in p5313, so here are the repack results:
    +    A similar, but private, repo has even more extremes during repacking:
     
    -    Test                                      this tree
    -    -------------------------------------------------------------
    -    5313.10: repack                           91.76(680.94+2.48)
    -    5313.11: repack size                               439.1M
    -    5313.12: repack with --path-walk          110.35(130.46+0.74)
    -    5313.13: repack size with --path-walk              155.3M
    +    Test                                      HEAD
    +    -----------------------------------------------------------------
    +    5313.10: repack                           2138.22(11961.00+17.67)
    +    5313.11: repack size                                 6.4G
    +    5313.12: repack with --path-walk          1351.46(1418.28+3.96)
    +    5313.13: repack size with --path-walk              804.1M
     
    -    [1] https://github.com/microsoft/fluentui
    +    There are small benefits in size for my copy of the Git repository:
     
    -    Here, we see the significant improvement of a full repack using this
    -    strategy. The name-hash collisions in this repo cause the space
    -    problems. Those collisions also cause the repack command to spend a lot
    -    of cycles trying to find delta bases among files that are not actually
    -    very similar, so the lack of threading with the --path-walk feature is
    -    less pronounced in the process time.
    +    Test                                      HEAD
    +    -----------------------------------------------------------
    +    5313.10: repack                           22.11(98.37+1.64)
    +    5313.11: repack size                               126.4M
    +    5313.12: repack with --path-walk          66.89(75.61+0.58)
    +    5313.13: repack size with --path-walk              109.6M
     
    -    For the Linux kernel repository, we have these stats:
    +    As well as in the nodejs/node repository [3]:
     
    -    Test                                      this tree
    +    Test                                      HEAD
    +    -------------------------------------------------------------
    +    5313.2: thin pack                         0.01(0.01+0.00)
    +    5313.3: thin pack size                               1.6K
    +    5313.4: thin pack with --path-walk        0.02(0.01+0.00)
    +    5313.5: thin pack size with --path-walk              1.6K
    +    5313.6: big pack                          5.35(12.43+0.32)
    +    5313.7: big pack size                               52.2M
    +    5313.8: big pack with --path-walk         7.12(11.97+0.27)
    +    5313.9: big pack size with --path-walk              52.1M
    +    5313.10: repack                           87.74(342.90+4.24)
    +    5313.11: repack size                               739.7M
    +    5313.12: repack with --path-walk          212.79(245.05+1.78)
    +    5313.13: repack size with --path-walk              697.6M
    +
    +    [3] https://github.com/nodejs/node
    +
    +    This benefit also repeats in this instance in the Linux kernel repository:
    +
    +    Test                                      HEAD
         ---------------------------------------------------------------
    -    5313.10: repack                           553.61(1929.41+30.31)
    +    5313.2: thin pack                         0.04(0.00+0.03)
    +    5313.3: thin pack size                               4.6K
    +    5313.4: thin pack with --path-walk        0.03(0.01+0.01)
    +    5313.5: thin pack size with --path-walk              4.6K
    +    5313.6: big pack                          21.16(62.81+1.35)
    +    5313.7: big pack size                              158.3M
    +    5313.8: big pack with --path-walk         36.09(55.25+0.67)
    +    5313.9: big pack size with --path-walk             152.2M
    +    5313.10: repack                           734.26(2149.62+31.24)
         5313.11: repack size                                 2.5G
    -    5313.12: repack with --path-walk          1777.63(2044.16+7.47)
    -    5313.13: repack size with --path-walk                2.5G
    -
    -    This demonstrates that the --path-walk feature does not always present
    -    measurable improvements, especially in cases where the name-hash has
    -    very few collisions.
    +    5313.12: repack with --path-walk          1457.23(1618.15+7.00)
    +    5313.13: repack size with --path-walk                2.2G
    +
    +    It is important to see that even when the repository shape does not have
    +    many name-hash collisions, there is a slight space boost to be found
    +    using this method. Also, there is no known case where the space is
    +    worse with --path-walk. This is of course due to the second pass where
    +    all objects to be packed are sorted in the usual way and checked for
    +    deltas. This second pass is usually very fast as the path-walk has
    +    primed many objects with quality deltas that short-circuit other delta
    +    computation attempts.
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
    @@ builtin/repack.c: static int run_update_server_info = 1;
     -	N_("git repack [<options>]"),
     +	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
     +	   "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
    -+	   "[--write-midx] [--full-path-walk]"),
    ++	   "[--write-midx] [--path-walk]"),
      	NULL
      };
      
 7:  6ef8d67af4b !  8:  8bfe5116178 repack: update usage to match docs
    @@ Commit message
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
    - ## builtin/repack.c ##
    -@@ builtin/repack.c: static char *packdir, *packtmp_name, *packtmp;
    - static const char *const git_repack_usage[] = {
    - 	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
    - 	   "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
    --	   "[--write-midx] [--full-path-walk]"),
    -+	   "[--write-midx] [--path-walk]"),
    - 	NULL
    - };
    - 
    -
      ## t/t0450/txt-help-mismatches ##
     @@ t/t0450/txt-help-mismatches: rebase
      remote
 8:  1db90e361ba =  9:  6b9ca51bc7c pack-objects: enable --path-walk via config
 9:  0f3040b4b90 = 10:  3f05c89b9cc scalar: enable path-walk during push via config
10:  030d8ec238e = 11:  37572d70deb pack-objects: refactor path-walk delta phase
11:  fddc320eb0b ! 12:  7ae9a40f346 pack-objects: thread the path-based compression
    @@ Commit message
         objects being packed. (This was tested on a 16-core machine.)
     
         Test                                      HEAD~1    HEAD
    -    ---------------------------------------------------------------
    -    5313.2: thin pack                           0.01    0.01  +0.0%
    -    5313.4: thin pack with --path-walk          0.01    0.01  +0.0%
    -    5313.6: big pack                            2.54    2.60  +2.4%
    -    5313.8: big pack with --path-walk           4.70    3.09 -34.3%
    -    5313.10: repack                            28.75   28.55  -0.7%
    -    5313.12: repack with --path-walk          108.55   46.14 -57.5%
    +    -----------------------------------------------------------------
    +    5313.2: thin pack                         0.00        0.00  =
    +    5313.3: thin pack size                     589         589  +0.0%
    +    5313.4: thin pack with --path-walk        0.00        0.00  =
    +    5313.5: thin pack size with --path-walk    589         589  +0.0%
    +    5313.6: big pack                          2.84        2.80  -1.4%
    +    5313.7: big pack size                    14.0M       14.1M  +0.3%
    +    5313.8: big pack with --path-walk         5.46        3.77 -31.0%
    +    5313.9: big pack size with --path-walk   13.2M       13.2M  -0.0%
    +    5313.10: repack                           22.11      21.50  -2.8%
    +    5313.11: repack size                     126.4M     126.2M  -0.2%
    +    5313.12: repack with --path-walk          66.89      26.41 -60.5%
    +    5313.13: repack size with --path-walk    109.6M     109.6M  +0.0%
     
    -    On the microsoft/fluentui repo, where the --path-walk feature has been
    -    shown to be more effective in space savings, we get these results:
    +    This 60% reduction in 'git repack --path-walk' time is typical across
    +    all repos I used for testing. What is interesting is to compare when the
    +    overall time improves enough to outperform the standard case. These time
    +    improvements correlate with repositories with data shapes that
    +    significantly improve their data size as well.
     
    -    Test                                     HEAD~1    HEAD
    +    For example, the microsoft/fluentui repo has a 439M to 122M size
    +    reduction, and the repack time is now 36.6 seconds with --path-walk
    +    compared to 95+ seconds without it:
    +
    +    Test                                      HEAD~!    HEAD
    +    -----------------------------------------------------------------
    +    5313.2: thin pack                         0.41        0.42  +2.4%
    +    5313.3: thin pack size                    1.2M        1.2M  +0.0%
    +    5313.4: thin pack with --path-walk        0.08        0.05 -37.5%
    +    5313.5: thin pack size with --path-walk  18.4K       18.4K  +0.0%
    +    5313.6: big pack                          4.47        4.53  +1.3%
    +    5313.7: big pack size                    19.6M       19.7M  +0.3%
    +    5313.8: big pack with --path-walk         6.76        3.51 -48.1%
    +    5313.9: big pack size with --path-walk   16.5M       16.4M  -0.2%
    +    5313.10: repack                          96.87       99.05  +2.3%
    +    5313.11: repack size                    439.5M      439.0M  -0.1%
    +    5313.12: repack with --path-walk         95.68       36.55 -61.8%
    +    5313.13: repack size with --path-walk   122.6M      122.6M  +0.0%
    +
    +    In a more extreme example, an internal repository that has a similar
    +    name-hash collision issue to microsoft/fluentui reduces its size from
    +    6.4G to 805M with the --path-walk option. This also reduces the
    +    repacking time from 2,138 seconds to 478 seconds.
    +
    +    Test                                      HEAD~1    HEAD
    +    ------------------------------------------------------------------
    +    5313.10: repack                           2138.22   2138.19  -0.0%
    +    5313.11: repack size                         6.4G      6.4G  -0.0%
    +    5313.12: repack with --path-walk          1351.46    477.91 -64.6%
    +    5313.13: repack size with --path-walk      804.1M    804.1M  -0.0%
    +
    +    Finally, the Linux kernel repository is a good test for this repacking
    +    time change, even though the space savings is more reasonable:
    +
    +    Test                                      HEAD~1      HEAD
         ----------------------------------------------------------------
    -    5313.2: thin pack                         0.39     0.40  +2.6%
    -    5313.4: thin pack with --path-walk        0.08     0.07 -12.5%
    -    5313.6: big pack                          4.15     4.15  +0.0%
    -    5313.8: big pack with --path-walk         6.41     3.21 -49.9%
    -    5313.10: repack                          90.69    90.83  +0.2%
    -    5313.12: repack with --path-walk        108.23    49.09 -54.6%
    +    5313.10: repack                           734.26   735.11  +0.1%
    +    5313.11: repack size                        2.5G     2.5G  -0.0%
    +    5313.12: repack with --path-walk         1457.23   598.17 -59.0%
    +    5313.13: repack size with --path-walk       2.2G     2.2G  +0.0%
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>

This will be helpful in a future change.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.

This new walk is incompatible with some features and is ignored by
others:

 * Object filters are not currently integrated with the path-walk API,
   such as sparse-checkout or tree depth. A blobless packfile could be
   integrated easily, but that is deferred for later.

 * Server-focused features such as delta islands, shallow packs, and
   using a bitmap index are incompatible with the path-walk API.

 * The path walk API is only compatible with the --revs option, not
   taking object lists or pack lists over stdin. These alternative ways
   to specify the objects currently ignores the --path-walk option
   without even a warning.

Future changes will create performance tests that demonstrate the power
of this approach.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.

In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.

Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.

Running on my copy of the Git repository results in this data:

Test                                      HEAD
--------------------------------------------------------------
5313.2: thin pack                         0.00(0.00+0.00)
5313.3: thin pack size                                589
5313.4: thin pack with --path-walk        0.00(0.00+0.00)
5313.5: thin pack size with --path-walk               589
5313.6: big pack                          2.76(7.19+0.27)
5313.7: big pack size                               14.0M
5313.8: big pack with --path-walk         5.76(6.72+0.16)
5313.9: big pack size with --path-walk              13.2M

Note that the timing is slower because there is no threading in the
--path-walk case (yet).

The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with collisions. An open source example can be
found in the microsoft/fluentui repo [1] at a certain commit [2].

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Running the tests on this repo results in the following output:

Test                                      HEAD
--------------------------------------------------------------
5313.2: thin pack                         0.28(0.40+0.03)
5313.3: thin pack size                               1.2M
5313.4: thin pack with --path-walk        0.07(0.06+0.00)
5313.5: thin pack size with --path-walk             18.4K
5313.6: big pack                          4.05(29.48+0.41)
5313.7: big pack size                               19.7M
5313.8: big pack with --path-walk         6.01(9.17+0.20)
5313.9: big pack size with --path-walk              16.5M

Notice in particular that in the small thin pack, the time performance
has improved from 0.28s to 0.08s and this is likely due to the improved
size of the resulting pack: 18.4K instead of 1.2M.

Finally, running this on a copy of the Linux kernel repository results
in these data points:

Test                                      HEAD
--------------------------------------------------------------
5313.2: thin pack                         0.03(0.02+0.00)
5313.3: thin pack size                               4.6K
5313.4: thin pack with --path-walk        0.03(0.01+0.01)
5313.5: thin pack size with --path-walk              4.6K
5313.6: big pack                          21.06(60.57+1.45)
5313.7: big pack size                              158.3M
5313.8: big pack with --path-walk         37.65(57.83+0.67)
5313.9: big pack size with --path-walk             152.3M

Signed-off-by: Derrick Stolee <stolee@gmail.com>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
It can be notoriously difficult to detect if delta bases are being
computed properly during 'git push'. Construct an example where it will
make a kilobyte worth of difference when a delta base is not found. We
can then use the progress indicators to distinguish between bytes and
KiB depending on whether the delta base is found and used.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:

Test                                      HEAD
----------------------------------------------------------
5313.2: thin pack                         0.41(0.48+0.03)
5313.3: thin pack size                               1.2M
5313.4: thin pack with --path-walk        0.08(0.05+0.02)
5313.5: thin pack size with --path-walk             18.4K
5313.6: big pack                          4.47(30.62+0.40)
5313.7: big pack size                               19.6M
5313.8: big pack with --path-walk         6.76(9.87+0.23)
5313.9: big pack size with --path-walk              16.5M
5313.10: repack                           96.87(664.29+2.75)
5313.11: repack size                               439.5M
5313.12: repack with --path-walk          95.68(109.90+0.92)
5313.13: repack size with --path-walk              122.6M

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

A similar, but private, repo has even more extremes during repacking:

Test                                      HEAD
-----------------------------------------------------------------
5313.10: repack                           2138.22(11961.00+17.67)
5313.11: repack size                                 6.4G
5313.12: repack with --path-walk          1351.46(1418.28+3.96)
5313.13: repack size with --path-walk              804.1M

There are small benefits in size for my copy of the Git repository:

Test                                      HEAD
-----------------------------------------------------------
5313.10: repack                           22.11(98.37+1.64)
5313.11: repack size                               126.4M
5313.12: repack with --path-walk          66.89(75.61+0.58)
5313.13: repack size with --path-walk              109.6M

As well as in the nodejs/node repository [3]:

Test                                      HEAD
-------------------------------------------------------------
5313.2: thin pack                         0.01(0.01+0.00)
5313.3: thin pack size                               1.6K
5313.4: thin pack with --path-walk        0.02(0.01+0.00)
5313.5: thin pack size with --path-walk              1.6K
5313.6: big pack                          5.35(12.43+0.32)
5313.7: big pack size                               52.2M
5313.8: big pack with --path-walk         7.12(11.97+0.27)
5313.9: big pack size with --path-walk              52.1M
5313.10: repack                           87.74(342.90+4.24)
5313.11: repack size                               739.7M
5313.12: repack with --path-walk          212.79(245.05+1.78)
5313.13: repack size with --path-walk              697.6M

[3] https://github.com/nodejs/node

This benefit also repeats in this instance in the Linux kernel repository:

Test                                      HEAD
---------------------------------------------------------------
5313.2: thin pack                         0.04(0.00+0.03)
5313.3: thin pack size                               4.6K
5313.4: thin pack with --path-walk        0.03(0.01+0.01)
5313.5: thin pack size with --path-walk              4.6K
5313.6: big pack                          21.16(62.81+1.35)
5313.7: big pack size                              158.3M
5313.8: big pack with --path-walk         36.09(55.25+0.67)
5313.9: big pack size with --path-walk             152.2M
5313.10: repack                           734.26(2149.62+31.24)
5313.11: repack size                                 2.5G
5313.12: repack with --path-walk          1457.23(1618.15+7.00)
5313.13: repack size with --path-walk                2.2G

It is important to see that even when the repository shape does not have
many name-hash collisions, there is a slight space boost to be found
using this method. Also, there is no known case where the space is
worse with --path-walk. This is of course due to the second pass where
all objects to be packed are sorted in the usual way and checked for
deltas. This second pass is usually very fast as the path-walk has
primed many objects with quality deltas that short-circuit other delta
computation attempts.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that the builtin usage matches the
synopsis in the documentation. Update 'git repack' to match and remove
it from the exception list.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                      HEAD~1    HEAD
-----------------------------------------------------------------
5313.2: thin pack                         0.00        0.00  =
5313.3: thin pack size                     589         589  +0.0%
5313.4: thin pack with --path-walk        0.00        0.00  =
5313.5: thin pack size with --path-walk    589         589  +0.0%
5313.6: big pack                          2.84        2.80  -1.4%
5313.7: big pack size                    14.0M       14.1M  +0.3%
5313.8: big pack with --path-walk         5.46        3.77 -31.0%
5313.9: big pack size with --path-walk   13.2M       13.2M  -0.0%
5313.10: repack                           22.11      21.50  -2.8%
5313.11: repack size                     126.4M     126.2M  -0.2%
5313.12: repack with --path-walk          66.89      26.41 -60.5%
5313.13: repack size with --path-walk    109.6M     109.6M  +0.0%

This 60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the standard case. These time
improvements correlate with repositories with data shapes that
significantly improve their data size as well.

For example, the microsoft/fluentui repo has a 439M to 122M size
reduction, and the repack time is now 36.6 seconds with --path-walk
compared to 95+ seconds without it:

Test                                      HEAD~!    HEAD
-----------------------------------------------------------------
5313.2: thin pack                         0.41        0.42  +2.4%
5313.3: thin pack size                    1.2M        1.2M  +0.0%
5313.4: thin pack with --path-walk        0.08        0.05 -37.5%
5313.5: thin pack size with --path-walk  18.4K       18.4K  +0.0%
5313.6: big pack                          4.47        4.53  +1.3%
5313.7: big pack size                    19.6M       19.7M  +0.3%
5313.8: big pack with --path-walk         6.76        3.51 -48.1%
5313.9: big pack size with --path-walk   16.5M       16.4M  -0.2%
5313.10: repack                          96.87       99.05  +2.3%
5313.11: repack size                    439.5M      439.0M  -0.1%
5313.12: repack with --path-walk         95.68       36.55 -61.8%
5313.13: repack size with --path-walk   122.6M      122.6M  +0.0%

In a more extreme example, an internal repository that has a similar
name-hash collision issue to microsoft/fluentui reduces its size from
6.4G to 805M with the --path-walk option. This also reduces the
repacking time from 2,138 seconds to 478 seconds.

Test                                      HEAD~1    HEAD
------------------------------------------------------------------
5313.10: repack                           2138.22   2138.19  -0.0%
5313.11: repack size                         6.4G      6.4G  -0.0%
5313.12: repack with --path-walk          1351.46    477.91 -64.6%
5313.13: repack size with --path-walk      804.1M    804.1M  -0.0%

Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more reasonable:

Test                                      HEAD~1      HEAD
----------------------------------------------------------------
5313.10: repack                           734.26   735.11  +0.1%
5313.11: repack size                        2.5G     2.5G  -0.0%
5313.12: repack with --path-walk         1457.23   598.17 -59.0%
5313.13: repack size with --path-walk       2.2G     2.2G  +0.0%

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant