Hi,
Your extension looks quite good. I did an extensive review of your code over at https://www.mediawiki.org/wiki/User:Bawolff/review/Git2Pages . Well that page may look very long, most of the issues listed there are minor nitpicks.
- GitRepository will do a sparse checkout on the information, that is,
it will clone the repository but only keep the specified file (this was implemented to save space)
Don't think that works the way you think it does. The entire repo is still there (in fact, the entire history of the repo is there).
When I was first testing your extension, I tried to load a snippet from mediawiki/core.git . The process of cloning the repo basically made my web server unresponsive (eventually I got bored and restarted apache. I'm not sure why wfShellExec didn't kick in before that point). I think you need to do a shallow clone since you only need one revision of the file.
- The repositories will be cloned into a folder that is a md5 hash of
the url + branch to make sure that the program isn't cloning a ton of copies of the same repository
Martijn wrote:
Why hash it, and not just keep the url + branch encoded to some charset that is a valid path, saving rare yet hairy collisions?
I actually like the hashing. The chance of a collision happening seems so unlikely that its not worth worrying about (If it is a concern, one could use sha256 encoded as base36 to make collisions even less likely). It also prevents the somewhat unlikely (but more likely than a collision imo) of a url+branch combination that is > 255 characters.
One thing I would recommend is prefix the directory names with something like mw-git2page-<hash> so that if a system admin sees all these entries, s/he would know where they are coming from.
Note, using directory names that can be pre-determined in a public /tmp directory is a bit dangerous in a shared server. Another user could make the directory, put something malicious in it (for example an evil post-merge hook), and then have your script use the malicious data. One way around that could be to add the $wgSecretKey (and some salt) to the variables that generate the hash that becomes the directory name.
This is my baseline program. It works (for me at least). I have a few ideas of what to work on next, but I would really like to know if I'm going in the right direction. Is this something you would use? How does my code look, is the implementation up to the MediaWiki coding standard? buttt You can find the progression of the code on gerrit[3].
Your code actually looks quite good. Better then much of the code that has come out of "mentorship" type projects in the past (imho). I put quite extensive comments on the wiki page I linked above.
Here are some ideas of what I might want to implement while still on the internship:
- Instead of a <pre> tag, encase it in a <syntaxhighlight lang> tag if
it's code, maybe add a flag for user to supply the language
That would be cool.
- Keep a database of all the repositories that a wiki has (though not
sure how to handle deletions)
One possibility with that (although quite a bit of work) would be to have a post commit hook in a git server that caused the wiki page to be re-rendered whenever a snippet it includes changes.
As for deletion - Check out how LinksUpdate stuff works, and how extensions like GlobalUsage that tie into it work. In particular the various hooks in that file: https://www.mediawiki.org/wiki/Category:MediaWiki_hooks_included_in_LinksUpd... If you meant page deletion (instead of snippet deletion) see the ArticleDeleteComplete hook.
Here are some problems I might face:
- If I update the working tree each time a file from the same
repository is added, then the line numbers may not match the old file
- Should I be periodically updating the repositories or perhaps keep
multiple snapshots of the same repository
I think its expected that if somebody puts {{#snippet:...}} on a page, and then someone commits an update to the file being transcluded by {{#snippet}}, that the {{#snippet:...}} would eventually update (aka on the next page parse)
Having an option to include a snippet from a specific commit - {{#snippet:filename=foo|repository=bar|revision=bf75623112354}} would be cool (or could the branch option be used to do this already?)
Also, having an option to start at some line based on some pattern (Start at line that has word foo on it) might be cool.
- Cloning an entire repository and keeping only one file does not seem
ideal, but I've yet to find a better solution, the more repositories being used concurrently the bigger an issue this might be
Yes that's a problem, especially on a big repository. At the very least it should be a shallow clone. Note when you clone the repo - you keep the entire repo, even if only some files are shown.
- I'm also worried about security implications of my program. Security
isn't my area of expertise, and I would definitely appreciate some input from people with a security background
I've commented quite a bit about security on the wikipage I linked to. However they key points in this area that I saw: *Option to only allow a whitelist of repos (for example, only allow wmf git repos) *There's was a couple places where escaping was missing, but that's an easy fix (and most of the time you did a good job with that). *There's the issue about someone pre-creating a directory in /tmp with malicious data that I mentioned above. *URLs should be validated.
Of course, other people may see things I don't.
The last issue that I'm concerned about - is what if two instances of this extension try and get the same snippet at the same time. Will the git commands step over each other? Does some sort of filesystem based lock have to be implemented?
Anyhow, good work on your extension. I hope that one day Manual:$wgFoo pages will have {{#snippet}}'s on them listing the actual default values directly from includes/DefaultSettings.php
Cheers, bawolff
p.s. if any of that was confusing or unclear, please let me know.