Hi everyone,
My name is Teresa (or terrrydactyl if you've seen me on IRC) and I've been interning at Wikimedia for the last few months through the Outreach Program for Women[1]. My project, Git2Pages[2], is an extension to pull snippets of code/text from a git repository. I've been working hard on learning PHP and the MediaWiki framework/development cycle. My internship is ending soon and I wanted to reach out to the community and ask for feedback.
Here's what the program currently does: - User supplies (git) url, filename, branch, startline, endline using the #snippet tag - Git2Pages.body.php will validate the information and then pass on the inputs into my library, GitRepository.php - GitRepository will do a sparse checkout on the information, that is, it will clone the repository but only keep the specified file (this was implemented to save space) - The repositories will be cloned into a folder that is a md5 hash of the url + branch to make sure that the program isn't cloning a ton of copies of the same repository - If the repository already exists, the file will be added to the sparse-checkout file and the program will update the working tree - Once the repo is cloned, the program will go and yank the lines that the user requested and it'll return the text encased in a <pre> tag.
This is my baseline program. It works (for me at least). I have a few ideas of what to work on next, but I would really like to know if I'm going in the right direction. Is this something you would use? How does my code look, is the implementation up to the MediaWiki coding standard? buttt You can find the progression of the code on gerrit[3].
Here are some ideas of what I might want to implement while still on the internship: - Instead of a <pre> tag, encase it in a <syntaxhighlight lang> tag if it's code, maybe add a flag for user to supply the language - Keep a database of all the repositories that a wiki has (though not sure how to handle deletions)
Here are some problems I might face: - If I update the working tree each time a file from the same repository is added, then the line numbers may not match the old file - Should I be periodically updating the repositories or perhaps keep multiple snapshots of the same repository - Cloning an entire repository and keeping only one file does not seem ideal, but I've yet to find a better solution, the more repositories being used concurrently the bigger an issue this might be - I'm also worried about security implications of my program. Security isn't my area of expertise, and I would definitely appreciate some input from people with a security background
Thanks for taking the time to read this and thanks in advance for any feedback, bug reports, etc.
Have a great day, Teresa http://www.mediawiki.org/wiki/User:Chot
[1] https://www.mediawiki.org/wiki/Outreach_Program_for_Women [2] http://www.mediawiki.org/wiki/Extension:Git2Pages [3] https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/Git2Pages,n,...
On Mar 27, 2013 1:31 AM, "Teresa Cho" tcho708@gmail.com wrote:
Hi everyone,
My name is Teresa (or terrrydactyl if you've seen me on IRC) and I've been interning at Wikimedia for the last few months through the Outreach Program for Women[1]. My project, Git2Pages[2], is an extension to pull snippets of code/text from a git repository. I've been working hard on learning PHP and the MediaWiki framework/development cycle. My internship is ending soon and I wanted to reach out to the community and ask for feedback.
Cool stuff!
Here's what the program currently does:
- User supplies (git) url, filename, branch, startline, endline using
the #snippet tag
- Git2Pages.body.php will validate the information and then pass on
the inputs into my library, GitRepository.php
- GitRepository will do a sparse checkout on the information, that is,
it will clone the repository but only keep the specified file (this was implemented to save space)
- The repositories will be cloned into a folder that is a md5 hash of
the url + branch to make sure that the program isn't cloning a ton of copies of the same repository
Why hash it, and not just keep the url + branch encoded to some charset that is a valid path, saving rare yet hairy collisions?
- If the repository already exists, the file will be added to the
sparse-checkout file and the program will update the working tree
Will there be a re checkout for a duplicate request? Will the cache of files ever be cleaned?
- Once the repo is cloned, the program will go and yank the lines that
the user requested and it'll return the text encased in a <pre> tag.
This is my baseline program. It works (for me at least). I have a few ideas of what to work on next, but I would really like to know if I'm going in the right direction. Is this something you would use? How does my code look, is the implementation up to the MediaWiki coding standard? buttt You can find the progression of the code on gerrit[3].
Here are some ideas of what I might want to implement while still on the internship:
- Instead of a <pre> tag, encase it in a <syntaxhighlight lang> tag if
it's code, maybe add a flag for user to supply the language
- Keep a database of all the repositories that a wiki has (though not
sure how to handle deletions)
Here are some problems I might face:
- If I update the working tree each time a file from the same
repository is added, then the line numbers may not match the old file
- Should I be periodically updating the repositories or perhaps keep
multiple snapshots of the same repository
- Cloning an entire repository and keeping only one file does not seem
ideal, but I've yet to find a better solution, the more repositories being used concurrently the bigger an issue this might be
- I'm also worried about security implications of my program. Security
isn't my area of expertise, and I would definitely appreciate some input from people with a security background
Thanks for taking the time to read this and thanks in advance for any feedback, bug reports, etc.
Have a great day, Teresa http://www.mediawiki.org/wiki/User:Chot
[1] https://www.mediawiki.org/wiki/Outreach_Program_for_Women [2] http://www.mediawiki.org/wiki/Extension:Git2Pages [3]
https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/Git2Pages,n,...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Why hash it, and not just keep the url + branch encoded to some charset that is a valid path, saving rare yet hairy collisions?
Hmm, good point. Originally, I had my program cloned via a git clone that would just clone into a folder of whatever the filename was. Therefore if we had two repositories named MyRepo, it would cause problems. We (my mentor Sebastien aka Dereckson) had done the md5 hash as a way to combat that but I supposed we didn't think of creating a folder using just the name of the url. I think this was back when we might use local repositories. My program can use urls like this: /some/path/to/repo. In which case, you may get different local urls? Thinking about it, it doesn't really make sense. I think this is just one of those scenarios where we thought of a solution and worked and the program evolved in a way where it wasn't necessary anymore. Changing the hash back to charset of a path should be an easy fix.
Speaking of local repositories, another idea was to disable the option to have a local repository as the git url as a safety precaution or something. Should I do away with the local repo feature entirely or should I set a flag and have it be off by default, but allow the user to choose to turn it on?
Will there be a re checkout for a duplicate request? Will the cache of files ever be cleaned?
I did a little test and it looks like if you duplicate the request, the original content doesn't change. So even if it has be remotely changed, the local copy doesn't change. Additionally, when you add a file to the sparse-checkout, the ` git read-tree -mu HEAD ` command only seems to pull the added file but doesn't seem the change the contents of a file that's already in the repo. So right now the only way to get new content is to delete the repository and rerun it. This is a good thing to work on. I'm not sure what you mean by the cache of the files being clean. Can you elaborate on the scenario and I can give you a better answer.
Thanks for your feedback! :)
-- Teresa
Hi,
Your extension looks quite good. I did an extensive review of your code over at https://www.mediawiki.org/wiki/User:Bawolff/review/Git2Pages . Well that page may look very long, most of the issues listed there are minor nitpicks.
- GitRepository will do a sparse checkout on the information, that is,
it will clone the repository but only keep the specified file (this was implemented to save space)
Don't think that works the way you think it does. The entire repo is still there (in fact, the entire history of the repo is there).
When I was first testing your extension, I tried to load a snippet from mediawiki/core.git . The process of cloning the repo basically made my web server unresponsive (eventually I got bored and restarted apache. I'm not sure why wfShellExec didn't kick in before that point). I think you need to do a shallow clone since you only need one revision of the file.
- The repositories will be cloned into a folder that is a md5 hash of
the url + branch to make sure that the program isn't cloning a ton of copies of the same repository
Martijn wrote:
Why hash it, and not just keep the url + branch encoded to some charset that is a valid path, saving rare yet hairy collisions?
I actually like the hashing. The chance of a collision happening seems so unlikely that its not worth worrying about (If it is a concern, one could use sha256 encoded as base36 to make collisions even less likely). It also prevents the somewhat unlikely (but more likely than a collision imo) of a url+branch combination that is > 255 characters.
One thing I would recommend is prefix the directory names with something like mw-git2page-<hash> so that if a system admin sees all these entries, s/he would know where they are coming from.
Note, using directory names that can be pre-determined in a public /tmp directory is a bit dangerous in a shared server. Another user could make the directory, put something malicious in it (for example an evil post-merge hook), and then have your script use the malicious data. One way around that could be to add the $wgSecretKey (and some salt) to the variables that generate the hash that becomes the directory name.
This is my baseline program. It works (for me at least). I have a few ideas of what to work on next, but I would really like to know if I'm going in the right direction. Is this something you would use? How does my code look, is the implementation up to the MediaWiki coding standard? buttt You can find the progression of the code on gerrit[3].
Your code actually looks quite good. Better then much of the code that has come out of "mentorship" type projects in the past (imho). I put quite extensive comments on the wiki page I linked above.
Here are some ideas of what I might want to implement while still on the internship:
- Instead of a <pre> tag, encase it in a <syntaxhighlight lang> tag if
it's code, maybe add a flag for user to supply the language
That would be cool.
- Keep a database of all the repositories that a wiki has (though not
sure how to handle deletions)
One possibility with that (although quite a bit of work) would be to have a post commit hook in a git server that caused the wiki page to be re-rendered whenever a snippet it includes changes.
As for deletion - Check out how LinksUpdate stuff works, and how extensions like GlobalUsage that tie into it work. In particular the various hooks in that file: https://www.mediawiki.org/wiki/Category:MediaWiki_hooks_included_in_LinksUpd... If you meant page deletion (instead of snippet deletion) see the ArticleDeleteComplete hook.
Here are some problems I might face:
- If I update the working tree each time a file from the same
repository is added, then the line numbers may not match the old file
- Should I be periodically updating the repositories or perhaps keep
multiple snapshots of the same repository
I think its expected that if somebody puts {{#snippet:...}} on a page, and then someone commits an update to the file being transcluded by {{#snippet}}, that the {{#snippet:...}} would eventually update (aka on the next page parse)
Having an option to include a snippet from a specific commit - {{#snippet:filename=foo|repository=bar|revision=bf75623112354}} would be cool (or could the branch option be used to do this already?)
Also, having an option to start at some line based on some pattern (Start at line that has word foo on it) might be cool.
- Cloning an entire repository and keeping only one file does not seem
ideal, but I've yet to find a better solution, the more repositories being used concurrently the bigger an issue this might be
Yes that's a problem, especially on a big repository. At the very least it should be a shallow clone. Note when you clone the repo - you keep the entire repo, even if only some files are shown.
- I'm also worried about security implications of my program. Security
isn't my area of expertise, and I would definitely appreciate some input from people with a security background
I've commented quite a bit about security on the wikipage I linked to. However they key points in this area that I saw: *Option to only allow a whitelist of repos (for example, only allow wmf git repos) *There's was a couple places where escaping was missing, but that's an easy fix (and most of the time you did a good job with that). *There's the issue about someone pre-creating a directory in /tmp with malicious data that I mentioned above. *URLs should be validated.
Of course, other people may see things I don't.
The last issue that I'm concerned about - is what if two instances of this extension try and get the same snippet at the same time. Will the git commands step over each other? Does some sort of filesystem based lock have to be implemented?
Anyhow, good work on your extension. I hope that one day Manual:$wgFoo pages will have {{#snippet}}'s on them listing the actual default values directly from includes/DefaultSettings.php
Cheers, bawolff
p.s. if any of that was confusing or unclear, please let me know.
On Wed, 27 Mar 2013 00:19:53 -0700, Brian Wolff bawolff@gmail.com wrote:
Note, using directory names that can be pre-determined in a public /tmp directory is a bit dangerous in a shared server. Another user could make the directory, put something malicious in it (for example an evil post-merge hook), and then have your script use the malicious data. One way around that could be to add the $wgSecretKey (and some salt) to the variables that generate the hash that becomes the directory name.
Please don't. I've been trying to slowly move us away from depending on wgSecretKey's secrecy for security. Eventually I hope to try an eliminate dependence on it from extensions too. And in an ideal case, eventually stop setting it in the installer (unless you have an edge case where a little more entropy for CryptRand could be useful; Or maybe not, I need to double check which case that was, but it might not even exist anymore with our version requirements).
I see people over and over asking for help and inadvertently handing that information which is supposed to remain secret right over in public.
Instead of trying to make the paths a secret just don't put that data inside of public /tmp directories. I recommend setting your git director config to false and in an extension setup function set it to some path based on the upload directory. This is basically what we used to do with $wgTmpDirectory which was used by CACHE_DBA.
Cheers, bawolff
p.s. if any of that was confusing or unclear, please let me know.
On Wed, Mar 27, 2013 at 9:31 AM, Daniel Friesen daniel@nadir-seen-fire.com wrote:
On Wed, 27 Mar 2013 00:19:53 -0700, Brian Wolff bawolff@gmail.com wrote:
Please don't. I've been trying to slowly move us away from depending on wgSecretKey's secrecy for security. Eventually I hope to try an eliminate dependence on it from extensions too. And in an ideal case, eventually stop setting it in the installer (unless you have an edge case where a little more entropy for CryptRand could be useful; Or maybe not, I need to double check which case that was, but it might not even exist anymore with our version requirements).
I see people over and over asking for help and inadvertently handing that information which is supposed to remain secret right over in public.
Instead of trying to make the paths a secret just don't put that data inside of public /tmp directories. I recommend setting your git director config to false and in an extension setup function set it to some path based on the upload directory. This is basically what we used to do with $wgTmpDirectory which was used by CACHE_DBA.
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Getting slightly offtopic, but a world where people stop spamming us with $wgSecretKey would be nice ;)
However, you're still going to have $wgUpgradeKey, and $wgDBpass ... Perhaps it'd be cool to split LocalSettings.php into LocalSettings.php and PrivateSettings.php
I recommend setting your git director config to false and in an extension setup function set it to some path based on the upload directory
Given that the upload directory is web accessible (and many people don't even turn off php_engine in that directory [speaking of which, why don't we add that to the default .htaccess for that directory]), having arbitrary git checkouts in such a directory seems kind of scary too.
--bawolff
wikitech-l@lists.wikimedia.org