forked from git/git
-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PATH WALK II: Add --path-walk option to 'git pack-objects' #1819
Open
derrickstolee
wants to merge
1,348
commits into
gitgitgadget:api-upstream
Choose a base branch
from
derrickstolee:path-walk-upstream
base: api-upstream
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
PATH WALK II: Add --path-walk option to 'git pack-objects' #1819
derrickstolee
wants to merge
1,348
commits into
gitgitgadget:api-upstream
from
derrickstolee:path-walk-upstream
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2b762f3
to
7ae9a40
Compare
97d669a
to
5252076
Compare
7ae9a40
to
389c18f
Compare
5252076
to
0bb607e
Compare
389c18f
to
bc37596
Compare
bc37596
to
68bc637
Compare
0bb607e
to
e716672
Compare
68bc637
to
bc37596
Compare
781b2ea
to
ef54342
Compare
Adapts reftable tree test script to clar framework by using clar assertions where necessary. Mentored-by: Patrick Steinhardt <[email protected]> Signed-off-by: Seyi Kuforiji <[email protected]> Acked-by: Patrick Steinhardt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
…ollowup * kn/reflog-migration-fix: reftable: write correct max_update_index to header
In put_be32(), we right-shift a uint32_t value various amounts and then assign the low 8-bits to individual "unsigned char" bytes, throwing away the high bits. For shifts smaller than 24 bits, those thrown away bits will be arbitrary bits from the original uint32_t. This works exactly as we want, but if you feed a constant, then sparse complains. For example if we write this (which we plan to do in a future patch): put_be32(hdr, PACK_SIGNATURE); then "make sparse" produces: compat/bswap.h:175:22: error: cast truncates bits from constant value (5041 becomes 41) compat/bswap.h:176:22: error: cast truncates bits from constant value (504143 becomes 43) compat/bswap.h:177:22: error: cast truncates bits from constant value (5041434b becomes 4b) And the same issue exists in the other put_be*() functions, when used with a constant. We can silence this warning by explicitly masking off the truncated bits. The compiler is smart enough to know the result is the same, and the asm generated by gcc (with both -O0 and -O2) is identical. Curiously this line already exists: put_be32(&hdr_version, INDEX_EXTENSION_VERSION2); in the fsmonitor.c file, but it does not get flagged because the CPP macro expands to a small integer (2). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
Both index-pack and unpack-objects accept a --pack_header argument. This is an undocumented internal argument used by receive-pack and fetch to pass along information about the header of the pack, which they've already read from the incoming stream. In preparation for a bugfix, let's factor the duplicated code into a common helper. The callers are still responsible for identifying the option. While this could likewise be factored out, it is more flexible this way (e.g., if they ever started using parse-options and wanted to handle both the stuck and unstuck forms). Likewise, the callers are responsible for reporting errors, though they both just call die(). I've tweaked unpack-objects to match index-pack in marking the error for translation. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
In order to recreate a pack header in our in-memory buffer, we cast the buffer to a "struct pack_header" and assign the individual fields. This is reported to cause SIGBUS on sparc64 due to alignment issues. We can work around this by using put_be32() which will write individual bytes into the buffer. Reported-by: Koakuma <[email protected]> Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
Both of these commands read the incoming pack into a static unsigned char buffer in BSS, and then parse it by casting the start of the buffer to a struct pack_header. This can result in SIGBUS on some platforms if the compiler doesn't place the buffer in a position that is properly aligned for 4-byte integers. This reportedly happens with unpack-objects (but not index-pack) on sparc64 when compiled with clang (but not gcc). But we are definitely in the wrong in both spots; since the buffer's type is unsigned char, we can't depend on larger alignment. When it works it is only because we are lucky. We'll fix this by switching to get_be32() to read the headers (just like the last few commits similarly switched us to put_be32() for writing into the same buffer). It would be nice to factor this out into a common helper function, but the interface ends up quite awkward. Either the caller needs to hardcode how many bytes we'll need, or it needs to pass us its fill()/use() functions as pointers. So I've just fixed both spots in the same way; this is not code that is likely to be repeated a third time (most of the pack reading code uses an mmap'd buffer, which should be properly aligned). I did make one tweak to the shared code: our pack_version_ok() macro expects us to pass the big-endian value we'd get by casting. We can introduce a "native" variant which uses the host integer ordering. Reported-by: Koakuma <[email protected]> Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
When parsing --pack_header=, we manually skip 14 bytes to the data. Let's use skip_prefix() to do this automatically. Note that we overwrite our pointer to the front of the string, so we have to add more context to the error message. We could avoid this by declaring an extra pointer to hold the value, but I think the modified message is actually preferable; it should give translators a bit more context. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
The code to check LSan results has been simplified and made more robust. * jk/lsan-race-ignore-false-positive: test-lib: add a few comments to LSan log checking test-lib: simplify lsan results check test-lib: invert return value of check_test_results_san_file_empty
Test clean-up. * jk/t7407-use-test-grep: t7407: use test_grep
The code to compute "unique" name used git_rand() which can fail or get stuck; the callsite does not require cryptographic security. Introduce the "insecure" mode and use it appropriately. * ps/reftable-get-random-fix: reftable/stack: accept insecure random bytes wrapper: allow generating insecure random bytes
A misconfigured "fsck.skiplist" configuration variable was not diagnosed as an error, which has been corrected. * jt/fsck-skiplist-parse-fix: fsck: reject misconfigured fsck.skipList
More code paths have a repository passed through the callchain, instead of assuming the primary the_repository object. * ps/the-repository: match-trees: stop using `the_repository` graph: stop using `the_repository` add-interactive: stop using `the_repository` tmp-objdir: stop using `the_repository` resolve-undo: stop using `the_repository` credential: stop using `the_repository` mailinfo: stop using `the_repository` diagnose: stop using `the_repository` server-info: stop using `the_repository` send-pack: stop using `the_repository` serve: stop using `the_repository` trace: stop using `the_repository` pager: stop using `the_repository` progress: stop using `the_repository`
Docfix. * aj/difftool-config-doc-fix: difftool docs: restore correct position of tool list
Completion script updates for zsh * dk/zsh-config-completion-fix: completion: repair config completion for Zsh
Doc markup fix. * mh/gitattr-doc-markup-fix: docs: fix typesetting of merge driver placeholders
Test update. * sk/unit-test-hash: t/unit-tests: convert hash to use clar test framework
Going into a secondary worktree and asking "is the main worktree bare?" did not work correctly when per-worktree configuration option was in use, which has been corrected. * op/worktree-is-main-bare-fix: worktree: detect from secondary worktree if main worktree is bare
Fetching into a bare repository incorrectly assumed it always used a mirror layout when deciding to update remote-tracking HEAD, which has been corrected. * bf/fetch-set-head-fix: fetch set_head: fix non-mirror remotes in bare repositories fetch set_head: refactor to use remote directly
Code clean-up. * kn/reflog-migration-fix-followup: reftable: prevent 'update_index' changes after adding records refs: use 'uint64_t' for 'ref_update.index' refs: mark `ref_transaction_update_reflog()` as static
Doc update. * ps/doc-http-upload-archive-service: doc: documentation for http.uploadarchive config option
Signed-off-by: Junio C Hamano <[email protected]>
Lazy-loading missing files in a blobless clone on demand is costly as it tends to be one-blob-at-a-time. "git backfill" is introduced to help bulk-download necessary files beforehand. * ds/backfill: backfill: assume --sparse when sparse-checkout is enabled backfill: add --sparse option backfill: add --min-batch-size=<n> option backfill: basic functionality and tests backfill: add builtin boilerplate
"git push --atomic --porcelain" used to ignore failures from the other side, losing the error status from the child process, which has been corrected. * ps/send-pack-unhide-error-in-atomic-push: send-pack: gracefully close the connection for atomic push t5543: atomic push reports exit code failure send-pack: new return code "ERROR_SEND_PACK_BAD_REF_STATUS" t5548: add porcelain push test cases for dry-run mode t5548: add new porcelain test cases t5548: refactor test cases by resetting upstream t5548: refactor to reuse setup_upstream() function t5504: modernize test by moving heredocs into test bodies
"git rev-list --missing=" learned to accept "print-info" that gives known details expected of the missing objects, like path and type. * jt/rev-list-missing-print-info: rev-list: extend print-info to print missing object type rev-list: add print-info action to print missing object path
"git difftool" code clean-up. * da/difftool-sans-the-repository: difftool: eliminate use of USE_THE_REPOSITORY_VARIABLE difftool: eliminate use of the_repository difftool: eliminate use of global variables
Test clean-up. * lo/t7603-path-is-file-update: t7603: replace test -f by test_path_is_file
A thunderbird helper script lost its bashism. * bc/contrib-thunderbird-patch-inline-fix: thunderbird-patch-inline: avoid bashism
Signed-off-by: Junio C Hamano <[email protected]>
Extract the scrip that "builds" Gitk from our Makefile so that we can reuse it in Meson. Signed-off-by: Patrick Steinhardt <[email protected]>
Upstream Git has introduced support for the Meson build system. Introduce support for Meson into gitk, as well, so that Git can easily build its vendored copy of Gitk via a `subproject()` directive. The instructions can be set up as follows: $ meson setup build $ meson compile -C build $ meson install -C build Specific options, like for example where Gitk shall be installed to, can be specified at setup time via `-D`. Available options can be discovered by running `meson configure` either in the source or build directory. Signed-off-by: Patrick Steinhardt <[email protected]>
* 'g4w-gitk' of https://github.com/dscho/gitk: gitk: make the "list references" default window width wider gitk: fix arrow keys in input fields with Tcl/Tk >= 8.6 gitk: Use an external icon file on Windows gitk: Unicode file name support gitk(Windows): avoid inadvertently calling executables in the worktree Signed-off-by: Johannes Sixt <[email protected]>
* 'pks-meson-support' of https://github.com/pks-t/gitk: gitk: introduce support for the Meson build system gitk: extract script to build executable Signed-off-by: Johannes Sixt <[email protected]>
* 'master' of https://github.com/j6t/gitk: gitk: introduce support for the Meson build system gitk: extract script to build executable gitk: make the "list references" default window width wider gitk: fix arrow keys in input fields with Tcl/Tk >= 8.6 gitk: Use an external icon file on Windows gitk: Unicode file name support gitk(Windows): avoid inadvertently calling executables in the worktree
This will be helpful in a future change. Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. The current implementation performs delta calculations while walking objects, which is not ideal for a few reasons. First, this will cause the "Enumerating objects" phase to be much longer than usual. Second, it does not take advantage of threading during the path-scoped delta calculations. Even with this lack of threading, the path-walk option is sometimes faster than the usual approach. Future changes will refactor this code to allow for threading, but that complexity is deferred until later to keep this patch as simple as possible. This new walk is incompatible with some features and is ignored by others: * Object filters are not currently integrated with the path-walk API, such as sparse-checkout or tree depth. A blobless packfile could be integrated easily, but that is deferred for later. * Server-focused features such as delta islands, shallow packs, and using a bitmap index are incompatible with the path-walk API. * The path walk API is only compatible with the --revs option, not taking object lists or pack lists over stdin. These alternative ways to specify the objects currently ignores the --path-walk option without even a warning. Future changes will create performance tests that demonstrate the power of this approach. Signed-off-by: Derrick Stolee <[email protected]>
The t0450 test script verifies that builtin usage matches the synopsis in the documentation. Adjust the builtin to match and then remove 'git pack-objects' from the exception list. Signed-off-by: Derrick Stolee <[email protected]>
The previous change added a --path-walk option to 'git pack-objects'. Create a performance test that demonstrates the time and space benefits of the feature. In order to get an appropriate comparison, we need to avoid reusing deltas and recompute them from scratch. Compare the creation of a thin pack representing a small push and the creation of a relatively large non-thin pack. Running on my copy of the Git repository results in this data: Test HEAD -------------------------------------------------------------- 5313.2: thin pack 0.00(0.00+0.00) 5313.3: thin pack size 589 5313.4: thin pack with --path-walk 0.00(0.00+0.00) 5313.5: thin pack size with --path-walk 589 5313.6: big pack 2.76(7.19+0.27) 5313.7: big pack size 14.0M 5313.8: big pack with --path-walk 5.76(6.72+0.16) 5313.9: big pack size with --path-walk 13.2M Note that the timing is slower because there is no threading in the --path-walk case (yet). The cases where the --path-walk option really shines is when the default name-hash is overwhelmed with collisions. An open source example can be found in the microsoft/fluentui repo [1] at a certain commit [2]. [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 Running the tests on this repo results in the following output: Test HEAD -------------------------------------------------------------- 5313.2: thin pack 0.28(0.40+0.03) 5313.3: thin pack size 1.2M 5313.4: thin pack with --path-walk 0.07(0.06+0.00) 5313.5: thin pack size with --path-walk 18.4K 5313.6: big pack 4.05(29.48+0.41) 5313.7: big pack size 19.7M 5313.8: big pack with --path-walk 6.01(9.17+0.20) 5313.9: big pack size with --path-walk 16.5M Notice in particular that in the small thin pack, the time performance has improved from 0.28s to 0.08s and this is likely due to the improved size of the resulting pack: 18.4K instead of 1.2M. Finally, running this on a copy of the Linux kernel repository results in these data points: Test HEAD -------------------------------------------------------------- 5313.2: thin pack 0.03(0.02+0.00) 5313.3: thin pack size 4.6K 5313.4: thin pack with --path-walk 0.03(0.01+0.01) 5313.5: thin pack size with --path-walk 4.6K 5313.6: big pack 21.06(60.57+1.45) 5313.7: big pack size 158.3M 5313.8: big pack with --path-walk 37.65(57.83+0.67) 5313.9: big pack size with --path-walk 152.3M Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, especially in conjunction with test such as: - t0411-clone-from-partial.sh : One test fetches from a repo that does not have the boundary objects. This causes the path-based walk to fail. Disable the variable for this test. - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo without a boundary object. - t5310-pack-bitmaps.sh : One test compares the case when packing with bitmaps to the case when packing without them. Since we disable the test variable when writing bitmaps, this causes a difference in the object list (the --path-walk option adds an extra object). Specify --no-path-walk in both processes for the comparison. Another test checks for a specific delta base, but when computing dynamically without using bitmaps, the base object it too small to be considered in the delta calculations so no base is used. - t5316-pack-delta-depth.sh : This script cares about certain delta choices and their chain lengths. The --path-walk option changes how these chains are selected, and thus changes the results of this test. - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. - t5332-multi-pack-reuse.sh : This test verifies that the preferred pack is used for delta reuse when possible. The --path-walk option is not currently aware of the preferred pack at all, so finds a different delta base. - t7406-submodule-update.sh : When using the variable, the --depth option collides with the --path-walk feature, resulting in a warning message. Disable the variable so this warning does not appear. I want to call out one specific test change that is only temporary: - t5530-upload-pack-error.sh : One test cares specifically about an "unable to read" error message. Since the current implementation performs delta calculations within the path-walk API callback, a different "unable to get size" error message appears. When this is changed in a future refactoring, this test change can be reverted. Signed-off-by: Derrick Stolee <[email protected]>
It can be notoriously difficult to detect if delta bases are being computed properly during 'git push'. Construct an example where it will make a kilobyte worth of difference when a delta base is not found. We can then use the progress indicators to distinguish between bytes and KiB depending on whether the delta base is found and used. Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. Add the --path-walk option to the performance tests in p5313. For the microsoft/fluentui repo [1] checked out at a specific commit [2], the results are very interesting: Test HEAD ---------------------------------------------------------- 5313.2: thin pack 0.41(0.48+0.03) 5313.3: thin pack size 1.2M 5313.4: thin pack with --path-walk 0.08(0.05+0.02) 5313.5: thin pack size with --path-walk 18.4K 5313.6: big pack 4.47(30.62+0.40) 5313.7: big pack size 19.6M 5313.8: big pack with --path-walk 6.76(9.87+0.23) 5313.9: big pack size with --path-walk 16.5M 5313.10: repack 96.87(664.29+2.75) 5313.11: repack size 439.5M 5313.12: repack with --path-walk 95.68(109.90+0.92) 5313.13: repack size with --path-walk 122.6M [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 This repo suffers from having a lot of paths that collide in the name hash, so examining them in groups by path leads to better deltas. Also, in this case, the single-threaded implementation is competitive with the full repack. This is saving time diffing files that have significant differences from each other. A similar, but private, repo has even more extremes during repacking: Test HEAD ----------------------------------------------------------------- 5313.10: repack 2138.22(11961.00+17.67) 5313.11: repack size 6.4G 5313.12: repack with --path-walk 1351.46(1418.28+3.96) 5313.13: repack size with --path-walk 804.1M There are small benefits in size for my copy of the Git repository: Test HEAD ----------------------------------------------------------- 5313.10: repack 22.11(98.37+1.64) 5313.11: repack size 126.4M 5313.12: repack with --path-walk 66.89(75.61+0.58) 5313.13: repack size with --path-walk 109.6M As well as in the nodejs/node repository [3]: Test HEAD ------------------------------------------------------------- 5313.2: thin pack 0.01(0.01+0.00) 5313.3: thin pack size 1.6K 5313.4: thin pack with --path-walk 0.02(0.01+0.00) 5313.5: thin pack size with --path-walk 1.6K 5313.6: big pack 5.35(12.43+0.32) 5313.7: big pack size 52.2M 5313.8: big pack with --path-walk 7.12(11.97+0.27) 5313.9: big pack size with --path-walk 52.1M 5313.10: repack 87.74(342.90+4.24) 5313.11: repack size 739.7M 5313.12: repack with --path-walk 212.79(245.05+1.78) 5313.13: repack size with --path-walk 697.6M [3] https://github.com/nodejs/node This benefit also repeats in this instance in the Linux kernel repository: Test HEAD --------------------------------------------------------------- 5313.2: thin pack 0.04(0.00+0.03) 5313.3: thin pack size 4.6K 5313.4: thin pack with --path-walk 0.03(0.01+0.01) 5313.5: thin pack size with --path-walk 4.6K 5313.6: big pack 21.16(62.81+1.35) 5313.7: big pack size 158.3M 5313.8: big pack with --path-walk 36.09(55.25+0.67) 5313.9: big pack size with --path-walk 152.2M 5313.10: repack 734.26(2149.62+31.24) 5313.11: repack size 2.5G 5313.12: repack with --path-walk 1457.23(1618.15+7.00) 5313.13: repack size with --path-walk 2.2G It is important to see that even when the repository shape does not have many name-hash collisions, there is a slight space boost to be found using this method. Also, there is no known case where the space is worse with --path-walk. This is of course due to the second pass where all objects to be packed are sorted in the usual way and checked for deltas. This second pass is usually very fast as the path-walk has primed many objects with quality deltas that short-circuit other delta computation attempts. Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <[email protected]>
Previously, the --path-walk option to 'git pack-objects' would compute deltas inline with the path-walk logic. This would make the progress indicator look like it is taking a long time to enumerate objects, and then very quickly computed deltas. Instead of computing deltas on each region of objects organized by tree, store a list of regions corresponding to these groups. These can later be pulled from the list for delta compression before doing the "global" delta search. This presents a new progress indicator that can be used in tests to verify that this stage is happening. The current implementation is not integrated with threads, but could be done in a future update. Since we do not attempt to sort objects by size until after exploring all trees, we can remove the previous change to t5530 due to a different error message appearing first. Signed-off-by: Derrick Stolee <[email protected]>
Adapting the implementation of ll_find_deltas(), create a threaded version of the --path-walk compression step in 'git pack-objects'. This involves adding a 'regions' member to the thread_params struct, allowing each thread to own a section of paths. We can simplify the way jobs are split because there is no value in extending the batch based on name-hash the way sections of the object entry array are attempted to be grouped. We re-use the 'list_size' and 'remaining' items for the purpose of borrowing work in progress from other "victim" threads when a thread has finished its batch of work more quickly. Using the Git repository as a test repo, the p5313 performance test shows that the resulting size of the repo is the same, but the threaded implementation gives gains of varying degrees depending on the number of objects being packed. (This was tested on a 16-core machine.) Test HEAD~1 HEAD ----------------------------------------------------------------- 5313.2: thin pack 0.00 0.00 = 5313.3: thin pack size 589 589 +0.0% 5313.4: thin pack with --path-walk 0.00 0.00 = 5313.5: thin pack size with --path-walk 589 589 +0.0% 5313.6: big pack 2.84 2.80 -1.4% 5313.7: big pack size 14.0M 14.1M +0.3% 5313.8: big pack with --path-walk 5.46 3.77 -31.0% 5313.9: big pack size with --path-walk 13.2M 13.2M -0.0% 5313.10: repack 22.11 21.50 -2.8% 5313.11: repack size 126.4M 126.2M -0.2% 5313.12: repack with --path-walk 66.89 26.41 -60.5% 5313.13: repack size with --path-walk 109.6M 109.6M +0.0% This 60% reduction in 'git repack --path-walk' time is typical across all repos I used for testing. What is interesting is to compare when the overall time improves enough to outperform the standard case. These time improvements correlate with repositories with data shapes that significantly improve their data size as well. For example, the microsoft/fluentui repo has a 439M to 122M size reduction, and the repack time is now 36.6 seconds with --path-walk compared to 95+ seconds without it: Test HEAD~! HEAD ----------------------------------------------------------------- 5313.2: thin pack 0.41 0.42 +2.4% 5313.3: thin pack size 1.2M 1.2M +0.0% 5313.4: thin pack with --path-walk 0.08 0.05 -37.5% 5313.5: thin pack size with --path-walk 18.4K 18.4K +0.0% 5313.6: big pack 4.47 4.53 +1.3% 5313.7: big pack size 19.6M 19.7M +0.3% 5313.8: big pack with --path-walk 6.76 3.51 -48.1% 5313.9: big pack size with --path-walk 16.5M 16.4M -0.2% 5313.10: repack 96.87 99.05 +2.3% 5313.11: repack size 439.5M 439.0M -0.1% 5313.12: repack with --path-walk 95.68 36.55 -61.8% 5313.13: repack size with --path-walk 122.6M 122.6M +0.0% In a more extreme example, an internal repository that has a similar name-hash collision issue to microsoft/fluentui reduces its size from 6.4G to 805M with the --path-walk option. This also reduces the repacking time from 2,138 seconds to 478 seconds. Test HEAD~1 HEAD ------------------------------------------------------------------ 5313.10: repack 2138.22 2138.19 -0.0% 5313.11: repack size 6.4G 6.4G -0.0% 5313.12: repack with --path-walk 1351.46 477.91 -64.6% 5313.13: repack size with --path-walk 804.1M 804.1M -0.0% Finally, the Linux kernel repository is a good test for this repacking time change, even though the space savings is more reasonable: Test HEAD~1 HEAD ---------------------------------------------------------------- 5313.10: repack 734.26 735.11 +0.1% 5313.11: repack size 2.5G 2.5G -0.0% 5313.12: repack with --path-walk 1457.23 598.17 -59.0% 5313.13: repack size with --path-walk 2.2G 2.2G +0.0% Signed-off-by: Derrick Stolee <[email protected]>
bc37596
to
785dfb3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP
Thanks, -Stolee
Here is the range-diff from the equivalent patches in the previous version: