I have just uploaded a file, 1037c-output.txt
It contains the text from Federal Standard 1037C, after the following operations have been performed:
* stripping of HTML markup (which has left some nasties like 1012 for 10<sup>12</sup> ) * tracing of sources based on tags in text * stripping of items that contain stuff from non-Federal Govt. sources like NATO or CCITT... * stripping of bracketed abbreviations from titles * generated an introductory phrase where possible * added paragraph breaks in generally sensible places * added a heading like this {{ title }} * general wikification (term in bold, etc.) * classified articles by adding tags like # REDIRECT [[whatever]] #ONELINER for one-line articles < 120 chars #NONTRIVIAL for articles > 200 chars
Approx breakdown (this might be a few versions old)
1241 REDIRECTs 411 ONELINERs 799 mid-length items 2783 NONTRIVIALs
It is now in a form where not too much more programming is needed to shove some or all of it into the Wikipedia.
Would this be useful?
Neil
Neil Harris wrote:
I have just uploaded a file, 1037c-output.txt
It contains the text from Federal Standard 1037C, after the following operations have been performed:
Just in case other people are as clueless as me, what this is: GLOSSARY OF TELECOMMUNICATION TERMS
Most of the terms appear to be short definitions of things like "harmful interference" http://www.its.bldrdoc.gov/fs-1037/dir-017/_2541.htm
The copyright situation looks good: this is a publication of the U.S. Federal Government, as a standard, and therefore by law not copyright.
The content, I'll leave the judgment to other people.
--Jimbo
wikipedia-l@lists.wikimedia.org