As we've increased our use of sha1 hashes to identify unique content over
the past year, I occasionally see changesets or discussions about indexing
sha1's in mysql. When indexing a text field, it's generally beneficial to
define the smallest index that still uniquely matches a high percentage of
rows. Search and insert performance both benefit from the space savings.
As a cryptographic hash function, sha1 has a very high degree of
uniformity. We can estimate the percent of partial index look-ups that
will match a unique result just by comparing the size of the table to the
space covered by the index.
sha1 hashes are 160bits, which mediawiki stores in mysql with base36
encoding. base36(2^160) == "twj4yidkw7a8pn4g709kzmfoaol3x8g". Looking at
enwiki.revision.rev_sha1, the smallest current value is
000002xi72hkkhn1nvfdeffgp7e1w3s and the largest,
twj4yi9tgesxysgyi41bz16jdkwroha.
The number of combinations covered by indexing the top bits represented by
the left-most 4 thru 10 characters:
sha1_index(4) = 1395184 (twj4)
sha1_index(5) = 50226658 (twj4y)
sha1_index(6) = 1808159706 (twj4yi)
sha1_index(7) = 65093749429 (twj4yid)
sha1_index(8) = 2343374979464 (twj4yidk)
sha1_index(9) = 84361499260736 (twj4yidkw)
sha1_index(10) = 3037013973386503 (twj4yidkw7)
percentage of unique matches in a table of 2B sha1's:
sha1_index(7) = 96.92%
sha1_index(8) = 99.91%
sha1_index(9) = 99.997%
sha1_index(10) = 99.9999%
percentage of unique matches in a table of 10B sha1's:
sha1_index(8) = 99.573%
sha1_index(9) = 99.988%
sha1_index(10) = 99.9996%
Given current table sizes and growth rates, an 8 character index on a sha1
column should be sufficient for years for many cases (i.e. media files
outside of commons, revisions on projects outside of the top 10), while a
10 character index still provides >99.99% coverage of 100 billion sha1's.
Caveat: The likely but rare worst case for a partial index is that we may
have tables with hundreds of rows containing the same sha1, perhaps
revisions of a page that had a crazy revert war. A lookup for that
specific sha1 will have to do secondary lookups for each match, as would
lookups of any other sha1 that happens to collide within the index space.
If the index is large enough to make the later case quite unlikely, prudent
use of caching can address the first.
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1
hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be
sufficient in many cases, but this is still provides a >2/3 space savings
while covering a huge (2^51.43) space.
-Asher
Marcin Cieslak <saper(a)saper.info> writes:
> From the PostgreSQL side I see one problem with nesting - we are already
> using savepoints to emulate MySQL's "INSERT IGNORE" and friends.\
> It might be difficult to abuse that feature for something more than this.
> There is a class "SavepointPostgres" which is used for that.
Well, that is more of a stop-gap measure anyway. Long-term, we need to
abstract out insert ignore properly into something that does a
select and conditional insert, or a delete + insert, rather than
rely on throwing, catching, and discarding an error.
In an ideal world, the MySQL code would do that too instead of relying
on insert ignore.
--
Greg Sabino Mullane greg(a)endpoint.com
End Point Corporation
PGP Key: 0x14964AC8
Hello everyone,
It’s with great pleasure that I’m announcing that Mark Holmquist has joined the Wikimedia Foundation as a Feature Engineer.
Mark is a free software advocate and software engineer. Before joining us, he was getting his B.S. in Computer Science at the University of Redlands. You may have remembered him from back when he was a student doing the WMF org chart before he graduated in June of this year, or as a contractor working on UploadWizard and assorted bug zapping. :-)
He has contributed to many free software projects in the past, including Minetest, Etherpad, and OpenHatch. Recently he has worked on the EtherEditor extension for MediaWiki[0]. BTW, Erik will be wanting to suck your 20% time into finishing out the outstanding bugs. ;-)
On the side, he enjoys playing Bridge[1] and the occasional jaunt for Parkour[2]. He lives in the Mission in foggy San Francisco. He also really enjoy reading and discussing court cases, either oral arguments or decisions. He's subscribed to the Cornell Supreme Court bulletin [3], so bug him about the cases that are in the news. :-)
His first official day was on September 17th, but I flaked out because of the all-staff and then decided to procrastinate some more on this announcement since you already know him from all his participation on IRC and wikitech-l. Until December, he will be working with Gabriel Wicke on the Parsoid, which is the rewrite of the Wikitext parser to work with the VisualEditor proejct (I assume James and Trevor have completed the necessary hazing on the VE Team and he has gotten his t-shirt). But more generally, he'll be on the Features team with Editor Engagement and assorted front-end technical debt here at the WMF.
Please join me in welcoming Mark Holmquist to the Wikimedia Foundation. :-)
Take care,
Terry
[0] http://www.mediawiki.org/wiki/Extension:EtherEditor
[1] http://en.wikipedia.org/wiki/Contract_bridge
[2] https://en.wikipedia.org/wiki/Parkour
[3] http://www.law.cornell.edu/supct/cert/
dear all,
as some of you may already know, the GLAMwiki Toolset Project, http://outreach.wikimedia.org/wiki/GLAM/Toolset_project, is a collaboration between Wikimedia Nederland, Wikimedia UK, Wikimedia France and Europeana, with the goal of providing a set of tools to get materials from GLAM institutions onto Wikimedia Commons in a way that reuse can easily be tracked, and that Commons materials can easily be integrated back into the collection of the original GLAM or even other GLAMs.
as part of our initial goal of creating a GLAM Upload System, we are looking to gather Wikimedia community input on the proposed architecture and technologies. if you have time and interest, please take a look and let us know your thoughts, http://outreach.wikimedia.org/wiki/GLAM/Toolset_project/Request_for_Comment….
with kind regards,
dan
Hey,
So I'm working on https://gerrit.wikimedia.org/r/22167 (Uri class), and
it's failing a unit test. I know the reason it's failing the unit test is
because of wfUrlencode() (the failure only started occurring when I moved
wfUrlencode() over to the Uri class). However, I cannot figure out why it's
failing, because the actual code hasn't really changed at all (in fact some
of wfUrlencode is even copied/pasted). Maybe somebody can offer some
insight?
*--*
*Tyler Romeo*
Stevens Institute of Technology, Class of 2015
Major in Computer Science
www.whizkidztech.com | tylerromeo(a)gmail.com
Hi folks,
I am happy to announce that the Wikimedia Foundation has just launched Page Curation, a new suite of tools for reviewing articles on Wikipedia.
Current page patrol tools like Special:NewPages and Twinkle can be hard to use quickly and accurately, and have led to frustration for some users. Page Curation aims to improve that page patrol experience by making it faster and easier to review new pages, using two integrated tools: the New Pages Feed and the Curation Toolbar.
Read the Page Curation announcement on our blog:
http://blog.wikimedia.org/2012/09/25/page-curation-launch/
To learn more, visit our introduction page:
https://en.wikipedia.org/wiki/Wikipedia:Page_Curation/Introduction
If you are an experienced editor, please give Page Curation a try:
https://en.wikipedia.org/wiki/Special:NewPagesFeed
We are also holding IRC office hours on Wednesday, September 26 at 4pm PT (23:00 UTC), during which we will be happy to answer any questions you may have. Please report any issues on our talk page or to our Community Liaison, Oliver Keyes <okeyes(a)wikimedia.org>.
A number of patrollers have already started using Page Curation, and we hope that more curators will adopt this new toolkit over time. A 'release version' was deployed on the English Wikipedia on September 20, 2012, and we plan to make it available to other projects in coming weeks.
This feature was created in close collaboration with editors. We would like to take this opportunity to thank all the community members who patiently guided our progress over the past few months. This includes folks like Athleek123, DGG, Dori, Fluffernutter, Logan, The Helpful One, Tom Morris, Utar and WereSpielChequers, to name but a few. We are deeply grateful for your generous contributions to this project!
We designed Page Curation to offer a better experience, by making it easier for curators to review new pages and by providing more feedback to creators so they can improve Wikipedia together.
We hope that you will find this new tool useful. Enjoy!
Fabrice Florin
Product Manager, Editor Engagement Team
Wikimedia Foundation
User:Fabrice Florin (WMF)
https://en.wikipedia.org/wiki/Wikipedia:Editor_Engagement
>"When a page reaches X level of quality, that version becomes the default." >When creating all the interface message for editing, viewing, and history,
>this is definitely not easy to get right and keep simple for new users. >Anyway, to be clear, you can make the "latest version" the default for all
>pages and manually make the "latest reviewed" version the default on a
>per-page basis already. You just can't use "quality versions" as the default version.
Hi, and thanks for the info. On an immediate practical level I will inform them of this at the wiki, and we will unfortunately be forced to turn off the last approved version as default so as not to offput new users.
In terms of the future, I respectfully disagree with you that there is any essential real-life problem for users if the extension were to be implemented in the way I described. Even for those who share your evaluation (which I don't), there should be no reason not to provide such basic functionality as at least an option. Furthermore, to manually make the "latest reviewed" version the default on a per-page basis is extremely cumbersome (even using bots) and not a reasonable solution to the problem.
To conclude, I personally think the extension you have developed (I was only recently reminded that you are the primary developer) is already one of the most important developments for the future of Mediawiki (and often under appreciated as such). Please let me know if there is any positive way I can become involved in helping this functionality become part of the program.
Hi.
There's a fairly straightforward request to add an API/Developer/Developer
Hub link to the footer of Wikimedia wikis, near the "Disclaimers", "Mobile
view", etc. links: <https://bugzilla.wikimedia.org/show_bug.cgi?id=33464>.
This should be fairly trivial to implement, but I wanted to check if there
were any objections to such an idea. The basic rationale for including such
a link is that other sites such as Twitter and Facebook and Wordnik have
similar links. This kind of link encourages viewers to become involved in
development for the project (e.g., Wikipedia), as opposed to the MediaWiki
logo, which serves a somewhat different purpose/functionality.
Assuming there aren't any objections to the idea, the only remaining
question seems to be where to point such a link. Brandon suggested
<https://www.mediawiki.org/wiki/Project:Help>. I'm wondering if there's an
"API basics" FAQ or something similar on mediawiki.org that could be
included in whatever landing page is chosen. Something that includes common
questions about the API such as "What formats are available?", "Do I need an
API key?" (which I think is a pretty major question to answer, given how
many other APIs I've come across work), "How can I get started developing
for Wikimedia wikis?", etc. Does anyone know if such a page exists already?
MZMcBride
On Sep 24, 2012 7:18 PM, "Mark A. Hershberger" <mah(a)everybody.org> wrote:
> On 09/23/2012 06:33 PM, K. Peachey wrote:
> > On Mon, Sep 24, 2012 at 4:03 AM, Mark A. Hershberger <mah(a)everybody.org>
wrote:
> >> On 09/23/2012 12:54 PM, Krinkle wrote:
> >>> https://bugzilla.wikimedia.org/[...]
> >>
> >> Link shortened: http://hexm.de/lp
> >
> > There is no need to shorten urls in emails, Please don't.
>
> I've personally seen mail readers and MTAs that mangle long URLs. Since
> this sort of mangling happens, I see a need for a way to make URLs
> usable. I use a shortener for those URLS as a precautionary measure and
> to help me communicate with others.
I personally agree it's annoying and wish you didn't. But maybe there's
mangling examples I've not seen.
IMHO, it should usually be enough to either wrap in angle brackets or put
the URL in a footnote on it's own line with just the footnote number.
Anyway, I do certainly think we have bigger fish to fry.
-Jeremy