cat-file: add `%(objectmode)` avoid verifying submodules' OIDs #1689

dscho · 2024-03-11T10:37:38Z

The cat-file --batch command is very valuable in server settings, but so far it is missing a bit of functionality that would come in handy there.

For example, it is sometimes necessary to determine the object mode of a batch of tree objects' children.

This came up in $dayjob recently, and applies cleanly to v2.44.0.

cc: Jeff King peff@peff.net

Update the 'run_tests' test wrapper so that the first argument may refer to any specifier that uniquely identifies an object (e.g. a ref name, '<OID>:<path>', '<OID>^{<type>}', etc.), rather than only a full object ID. Also, add a test that uses a non-OID identifier, ensuring appropriate parsing in 'cat-file'. Signed-off-by: Victoria Dye <vdye@github.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

Add a formatting atom, used with the --batch-check/--batch-command options, that prints the octal representation of the object mode if a given revision includes that information, e.g. one that follows the format <tree-ish>:<path>. If the mode information does not exist, an empty string is printed instead. Signed-off-by: Victoria Dye <vdye@github.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

Submodules are strange creatures. They have OIDs, but the corresponding objects are not expected to be present in the current directory. Let's teach `cat-file` about this: This command should not even attempt to look up those objects, let alone declare them "missing". Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

dscho · 2024-03-11T18:55:05Z

/submit

gitgitgadget · 2024-03-11T18:56:12Z

Submitted as pull.1689.git.1710183362.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1689/dscho/cat-file-vs-submodules-v1

To fetch this version to local tag pr-1689/dscho/cat-file-vs-submodules-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1689/dscho/cat-file-vs-submodules-v1

gitgitgadget · 2024-03-11T21:46:23Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> The cat-file --batch command is very valuable in server settings, but so far
> it is missing a bit of functionality that would come in handy there.
>
> For example, it is sometimes necessary to determine the object mode of a
> batch of tree objects' children.

OK.

It is somewhat unsatisfying that --batch/--batch-check lacks so
much.  Even with %(objectmode) its nature of one-object-at-a-time
makes querying children of a large tree a chore, when you compare it
with something like "cat-file -p HEAD:" that allows you to grab the
needed information for all children with a single invocation.

This is orthogonal to what the patch wants to do, which is to enrich
the output side with more formatting, bit I wonder if we want to
consider enriching the input side?  e.g. instead of feeding just a
single object name from the standard input of "cat-file
--batch/--batch-check", perhaps a syntax can say "Here I have the
object name for a tree-ish object, but please pretend that I gave
you all the objects contained within it", or something?

Thanks, will queue.

gitgitgadget · 2024-03-11T21:58:55Z

t/t1006-cat-file.sh

@@ -112,65 +112,66 @@ strlen () {



On the Git mailing list, Junio C Hamano wrote (reply to this):

"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Victoria Dye <vdye@github.com> > > Update the 'run_tests' test wrapper so that the first argument may refer to > any specifier that uniquely identifies an object (e.g. a ref name, > '<OID>:<path>', '<OID>^{<type>}', etc.), rather than only a full object ID. > Also, add a test that uses a non-OID identifier, ensuring appropriate > parsing in 'cat-file'. > > Signed-off-by: Victoria Dye <vdye@github.com> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> > --- > t/t1006-cat-file.sh | 46 +++++++++++++++++++++++---------------------- > 1 file changed, 24 insertions(+), 22 deletions(-) > > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh > index e0c6482797e..ac1f754ee32 100755 > --- a/t/t1006-cat-file.sh > +++ b/t/t1006-cat-file.sh > @@ -112,65 +112,66 @@ strlen () { > > run_tests () { > type=$1 > - sha1=$2 > + object_name=$2 > + oid=$(git rev-parse --verify $object_name) > size=$3 > content=$4 > pretty_content=$5 > > - batch_output="$sha1 $type $size > + batch_output="$oid $type $size > $content" As "object_name" is now allowed to be any name in the 'extended SHA-1' syntax (cf. Documentation/revisions.txt), you should be a bit more careful in quoting. oid=$(git rev-parse --verify "$object_name") > test_expect_success "$type exists" ' > - git cat-file -e $sha1 > + git cat-file -e $object_name > ' Likewise. You may not currently use a path with SP in it to name a tree object, e.g., "HEAD:Read Me.txt", but protecting against such a pathname is a cheap investment for futureproofing. Looking good otherwise. Thanks.

gitgitgadget · 2024-03-11T22:17:14Z

Documentation/git-cat-file.txt

@@ -292,6 +292,11 @@ newline. The available atoms are:
 `objecttype`::


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index bbf851138ec..73bd78c0b63 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -272,6 +272,7 @@ struct expand_data { > struct object_id oid; > enum object_type type; > unsigned long size; > + unsigned short mode; > off_t disk_size; We are not saving the storage used in this structure by using "unsigned short" due to alignment, so I got curious where the choice came from, but I do not think of any sensible explanation. Let's to be consistent with the remainder of the system, like how the mode is stored in the in-core index (ce_mode) and in the in-core tree entry (name_entry.mode) and use "unsigned int" instead here. > +#define EXPAND_DATA_INIT { .mode = S_IFINVALID } Thanks for knowing about and choosing to use the INVALID thing (I would have naively chosen 0 without looking around enough and made things inconsistent). > + } else if (is_atom("objectmode", atom, len)) { > + if (!data->mark_query && !(S_IFINVALID == data->mode)) > + strbuf_addf(sb, "%06o", data->mode); Nit. I think if (!data->mark_query && data->mode != S_IFINVALID) would be a more common way to write the same thing. > @@ -766,7 +772,7 @@ static int batch_objects(struct batch_options *opt) > { > struct strbuf input = STRBUF_INIT; > struct strbuf output = STRBUF_INIT; > - struct expand_data data; > + struct expand_data data = EXPAND_DATA_INIT; > int save_warning; > int retval = 0; > > @@ -775,7 +781,6 @@ static int batch_objects(struct batch_options *opt) > * object_info to be handed to oid_object_info_extended for each > * object. > */ > - memset(&data, 0, sizeof(data)); Nice to see this go with the _INIT thing. > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh > index ac1f754ee32..6f25cc20ec6 100755 > --- a/t/t1006-cat-file.sh > +++ b/t/t1006-cat-file.sh > @@ -114,9 +114,10 @@ run_tests () { > type=$1 > object_name=$2 > oid=$(git rev-parse --verify $object_name) > - size=$3 > - content=$4 > - pretty_content=$5 > + mode=$3 > + size=$4 > + content=$5 > + pretty_content=$6 > > batch_output="$oid $type $size > $content" I wonder if appending $mode as an optional thing at the end would have made the patch less noisy? After all, the expectation above that does not have $mode, and the tests that are expected to produce output that match the expectation, do not have to change. And the existing invocation of run_tests that do not care about $mode do not have to change. But I guess if the damage is only with the above 7-lines (which would become just 1 if we made mode the $6 last tthing), it is not a huge deal either way?

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes: >> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh >> index ac1f754ee32..6f25cc20ec6 100755 >> --- a/t/t1006-cat-file.sh >> +++ b/t/t1006-cat-file.sh >> @@ -114,9 +114,10 @@ run_tests () { >> type=$1 >> object_name=$2 >> oid=$(git rev-parse --verify $object_name) >> - size=$3 >> - content=$4 >> - pretty_content=$5 >> + mode=$3 >> + size=$4 >> + content=$5 >> + pretty_content=$6 >> >> batch_output="$oid $type $size >> $content" > > I wonder if appending $mode as an optional thing at the end would > have made the patch less noisy? After all, the expectation above > that does not have $mode, and the tests that are expected to produce > output that match the expectation, do not have to change. And the > existing invocation of run_tests that do not care about $mode do not > have to change. > > But I guess if the damage is only with the above 7-lines (which > would become just 1 if we made mode the $6 last tthing), it is not a > huge deal either way? Unfortunately, not really. If we made the optional mode as the last thing, and allow run_tests() to be called without an explicit "", it may have avoided unnecessary conflicts with eb/hash-transition topic. Interested folks can see how well these three patches plays with the other topic by trying to merge it to 'seen'.

gitgitgadget · 2024-03-12T09:03:18Z

Documentation/git-cat-file.txt

@@ -407,6 +412,11 @@ Note also that multiple copies of an object may be present in the object
 database; in this case, it is undefined which copy's size or delta base


On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Mar 11, 2024 at 06:56:02PM +0000, Johannes Schindelin via GitGitGadget wrote: > diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt > index de29e6d79d9..69b50d2042f 100644 > --- a/Documentation/git-cat-file.txt > +++ b/Documentation/git-cat-file.txt > @@ -412,6 +412,11 @@ Note also that multiple copies of an object may be present in the object > database; in this case, it is undefined which copy's size or delta base > will be reported. > > +Submodules are handled specially in `git cat-file`, as the objects > +corresponding to the recorded OIDs are not expected to be present in the > +current repository. For that reason, submodules are reported as having > +type `submodule` and mode 1600000 and all other fields are zeroed out. I think there's an extra 0 in the mode here? It may also be worth being more explicit about when Git knows something is a submodule. Naively, reading the above I might think that: git ls-tree --format='%(objectname)' HEAD | git cat-file --batch-check would do something special with submodules. But it can't, as there's no context carried in just the objectname. This is obvious if you are familiar with how Git works, but I'm not sure it would be for all end users. So we could say something along the lines of: When `cat-file` is given a name within a tree that points to a submodule (e.g., `HEAD:my-submodule`), ... -Peff

gitgitgadget · 2024-03-12T09:03:20Z

User Jeff King <peff@peff.net> has been added to the cc: list.

gitgitgadget · 2024-03-12T09:03:21Z

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Mar 11, 2024 at 02:43:00PM -0700, Junio C Hamano wrote:

> It is somewhat unsatisfying that --batch/--batch-check lacks so
> much.  Even with %(objectmode) its nature of one-object-at-a-time
> makes querying children of a large tree a chore, when you compare it
> with something like "cat-file -p HEAD:" that allows you to grab the
> needed information for all children with a single invocation.
> 
> This is orthogonal to what the patch wants to do, which is to enrich
> the output side with more formatting, bit I wonder if we want to
> consider enriching the input side?  e.g. instead of feeding just a
> single object name from the standard input of "cat-file
> --batch/--batch-check", perhaps a syntax can say "Here I have the
> object name for a tree-ish object, but please pretend that I gave
> you all the objects contained within it", or something?

That is an interesting direction. In practice I guess you might want to
expand trees (to show their contents) or perhaps commits (to traverse
history and/or look at their trees). And we already have tools to do
that.

So for example you can already do:

  git ls-tree --format='%(objectname) %(objectmode)' HEAD

Or if you wanted to mix-and-match with other cat-file placeholders, you
can do:

  git ls-tree --format='%(objectname) %(objectmode)' HEAD |
  git cat-file --batch-check='%(objectname) %(deltabase) %(rest)'

That is a little less efficient (we look up the object twice), but once
you are working with hex object ids it is not too bad (cat-file is
heavily optimized here). Of course in the long run I think we should
move to a future where the formatting code is shared, and you can just
ask ls-tree for deltabase if you want to.

I think leaving this to specialized tools like ls-tree gives them a lot
of flexibility that a special input mode to cat-file might find awkward.
For example, recurse vs non-recursive tree listing. Or filtering with
pathspecs. And of course when you get into commits and traversal, there
are many rev-list options. :)

The strategy so far has been making sure cat-file can efficiently take
in the output of these other tools to further describe objects. But
moving towards a unified output formatting model would be even better, I
think. In the meantime, I think cat-file learning %(objectmode) makes
sense for single names (rather than listing trees), and fortunately it
uses the same (obvious) name that ls-tree does, so we won't have a
problem unifying them later.

The patch itself looked reasonable to me, modulo the comments you
already made.

-Peff

gitgitgadget · 2024-03-12T18:39:18Z

Documentation/git-cat-file.txt

@@ -407,6 +412,11 @@ Note also that multiple copies of an object may be present in the object
 database; in this case, it is undefined which copy's size or delta base


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com> writes: > +Submodules are handled specially in `git cat-file`, as the objects > +corresponding to the recorded OIDs are not expected to be present in the > +current repository. For that reason, submodules are reported as having > +type `submodule` and mode 1600000 and all other fields are zeroed out. While the above may not be technically wrong per-se, I am not sure if that is the more important part of what we want to tell our users. For example, "git ls-tree HEAD -- sha1collisiondetection" reports "160000 commit ...object.name.... sha1collisiondetection". Is it correct to say ... Submodules are handled specially in `git ls-tree`, as the objects corresponding to the recorded OIDs are not expected to be present in the current repository. ...? I do not think so. For the same reason, as an explanation for the reason why "git cat-file -t :sha1collisiondetection" just reports "submodule", the new text does not sit well. I actually have to wonder if the new behaviour proposed by this patch is a solution that is in search of a problem, or trying to solve an unstated problem in a wrong way. O=$(git rev-parse --verify :sha1collisiondetection) git cat-file -t "$O" should fail because the object whose name is $O is not available. Why should then this succeed and give a different result? git cat-file -t :sha1collisiondetection The "cat-file" command is about objects. While object's type may sometimes be inferrable (by being contained in a tree), if the user asks us to determine the type of the object, we should actually hit the object store, whether the commit object in question happens to be on our history or somebody else's history that our gitlink points at. So, I am not yet convinced that I should take this patch. Previous two steps looked good, though. Thanks. > index 73bd78c0b63..c59ad682d1f 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -128,7 +128,9 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name, > switch (opt) { > case 't': > oi.type_name = &sb; > - if (oid_object_info_extended(the_repository, &oid, &oi, flags) < 0) > + if (obj_context.mode == S_IFGITLINK) > + strbuf_addstr(&sb, "submodule"); > + else if (oid_object_info_extended(the_repository, &oid, &oi, flags) < 0) > die("git cat-file: could not get object info");

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Mar 12, 2024 at 11:35:16AM -0700, Junio C Hamano wrote: > I actually have to wonder if the new behaviour proposed by this > patch is a solution that is in search of a problem, or trying to > solve an unstated problem in a wrong way. > > O=$(git rev-parse --verify :sha1collisiondetection) > git cat-file -t "$O" > > should fail because the object whose name is $O is not available. > Why should then this succeed and give a different result? > > git cat-file -t :sha1collisiondetection > > The "cat-file" command is about objects. While object's type may > sometimes be inferrable (by being contained in a tree), if the user > asks us to determine the type of the object, we should actually hit > the object store, whether the commit object in question happens to > be on our history or somebody else's history that our gitlink points > at. > > So, I am not yet convinced that I should take this patch. Previous > two steps looked good, though. I'm not sure about "-t" in particular, but for batch output, I think if we stop at patch 2 it would be impossible to tell the difference between a submodule entry and a corrupt repo (or bad request). E.g., if I do this: (echo HEAD:Makefile; echo HEAD:sha1collisiondetection) | git cat-file --batch-check='%(objectname) %(objectmode)' after only patch 2, I'd get: 4e255c81f22386389c7460d8f5e59426673b5a5a 100644 HEAD:sha1collisiondetection missing We can't tell if HEAD didn't resolve, or it doesn't have that path, or if it's a regular blob entry and the repository is corrupt. Whereas after patch 3, we get: 4e255c81f22386389c7460d8f5e59426673b5a5a 100644 855827c583bc30645ba427885caa40c5b81764d2 160000 and the mode tells us that we resolved it to a submodule. The current behavior is not too surprising for cat-file, since it's whole purpose is to give you information about the objects themselves, and we don't have one here. But with this %(objectmode) format, we're really moving into a realm of "resolve this name for me and show me the context". We don't care about the details of the object at all! I think you could make an argument that the problem is shoe-horning new, slightly-mismatched functionality into cat-file. But there are lots of practical reasons to want to do so, as we discussed elsewhere. Since gitlinks are the only place where we'd expect an object to be missing, "simulating" them here isn't too bad. But I suspect there's a more general solution where cat-file learns to print dummy values for any missing object, letting the caller see what we _could_ find out. And then the submodule case just falls out naturally. I doubt we could make it the default for historical compatibility; we'd need a new option. This is all speculative on my part, of course. Probably Dscho or Victoria can explain their use case better. :) -Peff

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes: > I think you could make an argument that the problem is > shoe-horning new, slightly-mismatched functionality into > cat-file. But there are lots of practical reasons to want to do > so, as we discussed elsewhere. Since gitlinks are the only place > where we'd expect an object to be missing, "simulating" them here > isn't too bad. 100% agreed. This is something we should be asking about "HEAD:" tree object, not about "HEAD:sha1collisiondetection" object, if we are to ask cat-file. After all "cat-file p HEAD:" tells us that the thing is a submodule already. But unfortunately the "--batch" thing is limited to "give me an object and what you want to know about the object, and I'll tell you what I know about it" exchange, so it is a very bad match when you cannot really give it an object (which you do not have, like the target of the gitlink). So... > But I suspect there's a more > general solution where cat-file learns to print dummy values for any > missing object, letting the caller see what we _could_ find out. And > then the submodule case just falls out naturally. I doubt we could make > it the default for historical compatibility; we'd need a new option. ... "--batch" obviously needs to be extended, and %(objectmode) may be one direction to do so, but it would also work to allow us to ask about "HEAD:" and what it has at paths, which match a pathspec "sha1collisiondetection", an equivalent to give "cat-file --batch" a command to drive "ls-tree". > This is all speculative on my part, of course. Probably Dscho or > Victoria can explain their use case better. :) Likewise.

gitgitgadget · 2024-03-12T19:33:19Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

> That is an interesting direction. In practice I guess you might want to
> expand trees (to show their contents) or perhaps commits (to traverse
> history and/or look at their trees). And we already have tools to do
> that.
>
> So for example you can already do:
>
>   git ls-tree --format='%(objectname) %(objectmode)' HEAD
>
> Or if you wanted to mix-and-match with other cat-file placeholders, you
> can do:
>
>   git ls-tree --format='%(objectname) %(objectmode)' HEAD |
>   git cat-file --batch-check='%(objectname) %(deltabase) %(rest)'
>
> That is a little less efficient (we look up the object twice), but once
> you are working with hex object ids it is not too bad (cat-file is
> heavily optimized here). Of course in the long run I think we should
> move to a future where the formatting code is shared, and you can just
> ask ls-tree for deltabase if you want to.

I was imagining more about a use case "cat-file --batch" was
originally designed for---having a long-running single process
and ask any and all questions you have about various objects in the
object database by interacting with it.  So "yes, ls-tree can
already give us that information", while it is true, shoots at a
different direction from what I had in mind.

> The strategy so far has been making sure cat-file can efficiently take
> in the output of these other tools to further describe objects. But
> moving towards a unified output formatting model would be even better, I
> think. In the meantime, I think cat-file learning %(objectmode) makes
> sense for single names (rather than listing trees), and fortunately it
> uses the same (obvious) name that ls-tree does, so we won't have a
> problem unifying them later.

Yes, enriching the output format side is an orthogonal issue from
the input side, and the %(objectmode) thing that gives a piece of
information that is additionally available on top of the various
pieces of information about the object itself does make sense.

> The patch itself looked reasonable to me, modulo the comments you
> already made.
>
> -Peff

gitgitgadget · 2024-03-12T22:06:32Z

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Mar 12, 2024 at 12:28:48PM -0700, Junio C Hamano wrote:

> > Or if you wanted to mix-and-match with other cat-file placeholders, you
> > can do:
> >
> >   git ls-tree --format='%(objectname) %(objectmode)' HEAD |
> >   git cat-file --batch-check='%(objectname) %(deltabase) %(rest)'
> >
> > That is a little less efficient (we look up the object twice), but once
> > you are working with hex object ids it is not too bad (cat-file is
> > heavily optimized here). Of course in the long run I think we should
> > move to a future where the formatting code is shared, and you can just
> > ask ls-tree for deltabase if you want to.
> 
> I was imagining more about a use case "cat-file --batch" was
> originally designed for---having a long-running single process
> and ask any and all questions you have about various objects in the
> object database by interacting with it.  So "yes, ls-tree can
> already give us that information", while it is true, shoots at a
> different direction from what I had in mind.

Ah, yeah, that is one thing that cat-file does that no other part of the
system will. I do wonder in the long term if it is easier to teach
cat-file everything that all of the other commands can do, or to teach
all of the other commands some way of handling multiple requests in a
single process. ;)

(All obviously orthogonal to this patch series).

-Peff

dscho self-assigned this Mar 11, 2024

vdye and others added 3 commits March 11, 2024 11:56

dscho force-pushed the cat-file-vs-submodules branch from fd2f353 to 951f733 Compare March 11, 2024 10:58

dscho changed the title ~~cat-file: avoid verifying submodules' OIDs~~ cat-file: add %(objectmode) avoid verifying submodules' OIDs Mar 11, 2024

gitgitgadget bot reviewed Mar 11, 2024

View reviewed changes

gitgitgadget bot reviewed Mar 12, 2024

View reviewed changes

dscho mentioned this pull request Mar 22, 2024

Preserve user timezone gitgitgadget/gitgitgadget#1576

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cat-file: add `%(objectmode)` avoid verifying submodules' OIDs #1689

cat-file: add `%(objectmode)` avoid verifying submodules' OIDs #1689

		@@ -292,6 +292,11 @@ newline. The available atoms are:
		`objecttype`::

		@@ -407,6 +412,11 @@ Note also that multiple copies of an object may be present in the object
		database; in this case, it is undefined which copy's size or delta base

cat-file: add %(objectmode) avoid verifying submodules' OIDs #1689

Are you sure you want to change the base?

cat-file: add %(objectmode) avoid verifying submodules' OIDs #1689

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cat-file: add `%(objectmode)` avoid verifying submodules' OIDs #1689

cat-file: add `%(objectmode)` avoid verifying submodules' OIDs #1689