Hi,
Your extension looks quite good. I did an extensive review of your
code over at
https://www.mediawiki.org/wiki/User:Bawolff/review/Git2Pages
. Well that page may look very long, most of the issues listed there
are minor nitpicks.
- GitRepository will do a sparse checkout on the
information, that is,
it will clone the repository but only keep the specified file (this
was implemented to save space)
Don't think that works the way you think it does. The entire repo is
still there (in fact, the entire history of the repo is there).
When I was first testing your extension, I tried to load a snippet
from mediawiki/core.git . The process of cloning the repo basically
made my web server unresponsive (eventually I got bored and restarted
apache. I'm not sure why wfShellExec didn't kick in before that
point). I think you need to do a shallow clone since you only need one
revision of the file.
- The repositories will be cloned into a folder that
is a md5 hash of
the url + branch to make sure that the program isn't cloning a ton of
copies of the same repository
Martijn wrote:
>Why hash it, and not just keep the url + branch
encoded to some charset
>that is a valid path, saving rare yet hairy collisions?
I actually like the hashing. The chance of a collision happening seems
so unlikely that its not worth worrying about (If it is a concern, one
could use sha256 encoded as base36 to make collisions even less
likely). It also prevents the somewhat unlikely (but more likely than
a collision imo) of a url+branch combination that is > 255 characters.
One thing I would recommend is prefix the directory names with
something like mw-git2page-<hash> so that if a system admin sees all
these entries, s/he would know where they are coming from.
Note, using directory names that can be pre-determined in a public
/tmp directory is a bit dangerous in a shared server. Another user
could make the directory, put something malicious in it (for example
an evil post-merge hook), and then have your script use the malicious
data. One way around that could be to add the $wgSecretKey (and some
salt) to the variables that generate the hash that becomes the
directory name.
This is my baseline program. It works (for me at least). I have a few
ideas of what to work on next, but I would really like to know if I'm
going in the right direction. Is this something you would use? How
does my code look, is the implementation up to the MediaWiki coding
standard? buttt You can find the progression of the code on
gerrit[3].
Your code actually looks quite good. Better then much of the code that
has come out of "mentorship" type projects in the past (imho). I put
quite extensive comments on the wiki page I linked above.
Here are some ideas of what I might want to implement while still on
the internship:
- Instead of a <pre> tag, encase it in a <syntaxhighlight lang> tag if
it's code, maybe add a flag for user to supply the language
That would be cool.
- Keep a database of all the repositories that a wiki
has (though not
sure how to handle deletions)
One possibility with that (although quite a bit of work) would be to
have a post commit hook in a git server that caused the wiki page to
be re-rendered whenever a snippet it includes changes.
As for deletion - Check out how LinksUpdate stuff works, and how
extensions like GlobalUsage that tie into it work. In particular the
various hooks in that file:
https://www.mediawiki.org/wiki/Category:MediaWiki_hooks_included_in_LinksUp…
If you meant page deletion (instead of snippet deletion) see the
ArticleDeleteComplete hook.
Here are some problems I might face:
- If I update the working tree each time a file from the same
repository is added, then the line numbers may not match the old file
- Should I be periodically updating the repositories or perhaps keep
multiple snapshots of the same repository
I think its expected that if somebody puts {{#snippet:...}} on a page,
and then someone commits an update to the file being transcluded by
{{#snippet}}, that the {{#snippet:...}} would eventually update (aka
on the next page parse)
Having an option to include a snippet from a specific commit -
{{#snippet:filename=foo|repository=bar|revision=bf75623112354}} would
be cool (or could the branch option be used to do this already?)
Also, having an option to start at some line based on some pattern
(Start at line that has word foo on it) might be cool.
- Cloning an entire repository and keeping only one
file does not seem
ideal, but I've yet to find a better solution, the more repositories
being used concurrently the bigger an issue this might be
Yes that's a problem, especially on a big repository. At the very
least it should be a shallow clone. Note when you clone the repo - you
keep the entire repo, even if only some files are shown.
- I'm also worried about security implications of
my program. Security
isn't my area of expertise, and I would definitely appreciate some
input from people with a security background
I've commented quite a bit about security on the wikipage I linked to.
However they key points in this area that I saw:
*Option to only allow a whitelist of repos (for example, only allow
wmf git repos)
*There's was a couple places where escaping was missing, but that's an
easy fix (and most of the time you did a good job with that).
*There's the issue about someone pre-creating a directory in /tmp with
malicious data that I mentioned above.
*URLs should be validated.
Of course, other people may see things I don't.
The last issue that I'm concerned about - is what if two instances of
this extension try and get the same snippet at the same time. Will the
git commands step over each other? Does some sort of filesystem based
lock have to be implemented?
Anyhow, good work on your extension. I hope that one day Manual:$wgFoo
pages will have {{#snippet}}'s on them listing the actual default
values directly from includes/DefaultSettings.php
Cheers,
bawolff
p.s. if any of that was confusing or unclear, please let me know.