On Wed, Jan 28, 2009 at 2:52 PM, David Gerard dgerard@gmail.com wrote:
http://ebiquity.umbc.edu/blogger/2009/01/27/extracting-wikipedia-infoboxes-v...
Some infoboxes are designed for that sort of thing, some aren't. Some have footnotes for example, and lots of flexibility, which makes it harder, but not impossible, to parse the data. And some projects (for good reason) still virulently reject infoboxes, mainly because people who don't understand a particular subject try to force simplified statements (i.e. sentences, not words or numbers) inside an infobox, and lose nuance and context in the process, devaluing the article as a whole (reading the full text is ultimately more educational).
And not all such data is in infoboxes:
http://en.wikipedia.org/wiki/Wikipedia:Metadata
Something I tried to improve, which still needs expansion and TLC.
Some areas of data are in separate templates (not infobox templates) and some are in categories.
I'd like to add some of the data-heavy infoboxes to that list, like the ones in maths, physics, astronomy, geography, geology and chemistry, and the other 'hard' sciences. Are any of those infoboxes organised for the extraction of data the way the geographical co-ords templates are?
Carcharoth
I do think it's worth pointing out that literally every time I've mentioned dislike of infoboxes to non-WPians, the reply has been along the lines of "Why not? They're AWESOME!" I try to explain the objections, but usually the person is so set on the accessibility front that they can't see why anyone would want to avoid the boxes.
It's not just bots that want information in an easily parsed format.
-Luna
On Wed, Jan 28, 2009 at 7:04 AM, Carcharoth carcharothwp@googlemail.comwrote:
On Wed, Jan 28, 2009 at 2:52 PM, David Gerard dgerard@gmail.com wrote:
http://ebiquity.umbc.edu/blogger/2009/01/27/extracting-wikipedia-infoboxes-v...
Some infoboxes are designed for that sort of thing, some aren't. Some have footnotes for example, and lots of flexibility, which makes it harder, but not impossible, to parse the data. And some projects (for good reason) still virulently reject infoboxes, mainly because people who don't understand a particular subject try to force simplified statements (i.e. sentences, not words or numbers) inside an infobox, and lose nuance and context in the process, devaluing the article as a whole (reading the full text is ultimately more educational).
And not all such data is in infoboxes:
http://en.wikipedia.org/wiki/Wikipedia:Metadata
Something I tried to improve, which still needs expansion and TLC.
Some areas of data are in separate templates (not infobox templates) and some are in categories.
I'd like to add some of the data-heavy infoboxes to that list, like the ones in maths, physics, astronomy, geography, geology and chemistry, and the other 'hard' sciences. Are any of those infoboxes organised for the extraction of data the way the geographical co-ords templates are?
Carcharoth
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
2009/1/28 David Gerard dgerard@gmail.com:
http://ebiquity.umbc.edu/blogger/2009/01/27/extracting-wikipedia-infoboxes-v...
- d.
I saw one of the talks about work in this area on youtube. Other than the slight problem that they intialy focused on rambot articles their aproach looked fairly soild. They also got their bot to the point where it could pretty much write wikipedia articles by extracting info from thre web. The other thing they found was than opening sentances are very standardised which made them easy to extract information from.
2009/1/28 geni geniice@gmail.com:
2009/1/28 David Gerard dgerard@gmail.com:
http://ebiquity.umbc.edu/blogger/2009/01/27/extracting-wikipedia-infoboxes-v...
- d.
I saw one of the talks about work in this area on youtube. Other than the slight problem that they intialy focused on rambot articles their aproach looked fairly soild. They also got their bot to the point where it could pretty much write wikipedia articles by extracting info from thre web. The other thing they found was than opening sentances are very standardised which made them easy to extract information from.
Ah here's the relevant video:
http://ca.youtube.com/watch?v=cqOHbihYbhE&NR=1
"David Gerard" dgerard@gmail.com wrote in message news:fbad4e140901280652k19382f7cw7d6b217f610218b3@mail.gmail.com...
http://ebiquity.umbc.edu/blogger/2009/01/27/extracting-wikipedia-infoboxes-v...
(...)
Maybe it can be taken as a request for information about bots already in use or in development. A med student wrote a tool for filling out citations, and jehochman wrote it into a browser applet. It's botISH, meaning that it doesn't write the quote or find which text it is relevant to or submit the page. If I read the end of it correctly, they want to assign people to do it, first and make programmers, then machines learn from that (which has long been just simply logical, just to identify whether it is feasible).
If the sites that the bots access are vetted sources... There are a lot of things I do not know about bots, let me tell you that. _______ http://ecn.ab.ca/~brewhaha/finance/Manual_Spam_Control.htm
On Wed, Jan 28, 2009 at 6:52 AM, David Gerard dgerard@gmail.com wrote:
http://ebiquity.umbc.edu/blogger/2009/01/27/extracting-wikipedia-infoboxes-v...
For anyone interested in this topic, have a look at Freebase: http://www.freebase.com/
which has populated most of their database with parsed en:wp infoboxes. They have some interesting code & ideas, and are happy to talk about it (they hosted a WM-SF meetup last year).
-- phoebe
On 2/4/09, phoebe ayers phoebe.wiki@gmail.com wrote:
For anyone interested in this topic, have a look at Freebase: http://www.freebase.com/
For anyone interested in freebasing, please revert as vandalism any attempt to remove this tacky infobox:
http://en.wikipedia.org/w/index.php?title=Westroads_Mall_shooting&diff=p...
There are a few others like it, but most infoboxes are helpful.
—C.W.