Long-term searchability of the internet

List overview All Threads
Download

newer

older

The State of Wikipedia: video

Re: [WikiEN-l] "Invitation to...

Carcharoth

15 Jan 2011 15 Jan '11

10:11 a.m.

(Following on from another thread)

I have a theory that Wikipedia makes only *part* of the Internet not suck. Wikipedians aggregate online knowledge (and offline as well, but let's stick to online here), thus making it easier to find information about something, especially when there are lots of ambiguous hits on a Google search and you don't know enough to refine the search. But the useful parts of the internet (i.e. not the social media and similarly non-transient information-deficient areas of the internet) didn't stop growing when Wikipedia came along.

In theory, if the growth of the information-dense parts of the internet has continued to outstrip the growth of Wikipedia and the ability of Wikipedians to aggregate that knowledge base, then large parts of that part of the internet should still "suck" (to continue using that terminology) - i.e. be less amenable to searching due to absence of information or poorly organised information. I base this on many years of searching daily for information about topics ranging from the well-known to the borderline obscure to the outright obscure.

Over the years since Wikipedia started, the ability to find information online has changed beyond recognition. Around about 2004-5 (I need to check dates here), Wikipedia was rising rapidly up the search rankings, and now comes top or near the top on most searches. But there are still many, many topics on which no articles, or only redlinks, exist. I come across these daily when searching, and see that information on these topics is out there, scattered around if you search on Google, but hasn't been aggregated yet.

The question I have is whether the growth in the amount of unaggregated information (and I include other information-organising sites here, not just Wikipedia) will always outstrip the ability of various processes (include the growth of Wikipedia) to aggregate it into something more useful? If the long-term answer is yes, then information overload is inevitable (and search engines will gradually start to suck again). If the long-term answer is no, then at some point the online aggregation (or co-ordination of data to form information in the real sense) will start to overtake the flow of information from offline to online, and order will continue to emerge from the (relative) chaos.

The key seems to be the quality of the information put online. Well-organised and searchable sites and databases are good. Poorly organised information sources, less so, as while they can in theory be found by search engines, they may be less easy to distinguish from the background noise, though it also depends greatly on the amount of information you start with when carrying out a search for more information.

To take a specific example, I very occasionally come across names of people or topics where it is next-to-impossible to find out anything meaningful about them because the name is identical to that of someone else. Sometimes this is companies that name themselves after something well-known and any search is swamped by hits to that well-known namesake. Other times, it is someone more famous swamping a relatively obscure person - a recent example I found here is the physicist Otto Klemperer. Despite having the name and profession, it is remarkably difficult to find information about the physicist as opposed to the composer. If I had a birth year, it would be much easier, of course.

Carcharoth

Show replies by date

David Gerard

15 Jan 15 Jan

10:25 a.m.

On 15 January 2011 04:41, Carcharoth carcharothwp@googlemail.com wrote:

...

To take a specific example, I very occasionally come across names of people or topics where it is next-to-impossible to find out anything meaningful about them because the name is identical to that of someone else. Sometimes this is companies that name themselves after something well-known and any search is swamped by hits to that well-known namesake. Other times, it is someone more famous swamping a relatively o bscure person - a recent example I found here is the physicist Otto Klemperer. Despite having the name and profession, it is remarkably difficult to find information about the physicist as opposed to the composer. If I had a birth year, it would be much easier, of course.

This effect happened long before Wikipedia. One example from my experience: it was only since Wikipedia that I could find out anything on the Internet about the dazzle ships - the camouflage paintwork used on ships in World War I - as opposed to pages and pages about the OMD album.

Then there's the huge bias in Google hits toward computer-related uses of any term whatsoever. What's the first hit on "putty"? Not the construction filler - the whole first page of hits is for an obscure (though very good) piece of computer software.

The effect you describe is part of why search is a hard problem. At least we can say we've alleviated it slighty!

- d.

Ian Woollard

16 Jan 16 Jan

12:15 a.m.

On 15/01/2011, Carcharoth carcharothwp@googlemail.com wrote:

...

To take a specific example, I very occasionally come across names of people or topics where it is next-to-impossible to find out anything meaningful about them because the name is identical to that of someone else. Sometimes this is companies that name themselves after something well-known and any search is swamped by hits to that well-known namesake. Other times, it is someone more famous swamping a relatively obscure person - a recent example I found here is the physicist Otto Klemperer. Despite having the name and profession, it is remarkably difficult to find information about the physicist as opposed to the composer. If I had a birth year, it would be much easier, of course.

That's the primary advantage of an encyclopedia of course. It doesn't rely much on the vagaries of English.

...

Carcharoth

WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l

-- -Ian Woollard

Carcharoth

2:08 p.m.

On Sat, Jan 15, 2011 at 6:45 PM, Ian Woollard ian.woollard@gmail.com wrote:

...

On 15/01/2011, Carcharoth carcharothwp@googlemail.com wrote:

...
To take a specific example, I very occasionally come across names of people or topics where it is next-to-impossible to find out anything meaningful about them because the name is identical to that of someone else. Sometimes this is companies that name themselves after something well-known and any search is swamped by hits to that well-known namesake. Other times, it is someone more famous swamping a relatively obscure person - a recent example I found here is the physicist Otto Klemperer. Despite having the name and profession, it is remarkably difficult to find information about the physicist as opposed to the composer. If I had a birth year, it would be much easier, of course.

That's the primary advantage of an encyclopedia of course. It doesn't rely much on the vagaries of English.

Yeah, but it only helps if there is an entry on the person you are looking for information on! So far I have the date of his PhD (1923) in Berlin from the maths geneaology site:

http://www.genealogy.ams.org/id.php?id=62580

And that he worked with Hans Geiger and was the author of a paper in 1934 ('On the Radioactivity of Potassium and Rubidium'):

http://www.jstor.org/pss/96293

Plug in "Geiger-Klemperer ball counter" to a search engine, and you begin to get more details (there are a number of devices that are 'loosely' called Geiger counters, but are named for the people that developed them, such as Geiger-Muller, Geiger-Klemperer, and Rutherford-Geiger counters).

There is also a William Klemperer (who is a physicist and who has an article on Wikipedia), who is apparently related to the Otto Klemperer who is the famous conductor, but I wonder whether he is related to the Otto Klemperer who is a physicist, and people are confusing the two?

I also found a patent here for an electron lens:

http://www.google.co.uk/patents/about?id=0ClhAAAAEBAJ

Filed by a "Otto Ernst Heinrich Klemperer" on 31 Mar 1944, and issued December 1946.

Probably the same Otto Klemperer who was the author of "Electron Optics", which is still in print:

http://www.amazon.co.uk/Electron-Optics-Cambridge-Monographs-Physics/dp/0521...

The patent I only just discovered, but that is all I have on this Otto Klemperer at the moment.

Carcharoth

18 Jan 18 Jan

1:42 p.m.

On Sun, Jan 16, 2011 at 8:38 AM, Carcharoth carcharothwp@googlemail.com wrote:

...

The patent I only just discovered, but that is all I have on this Otto Klemperer at the moment.

I was rushing out the door when I wrote that last e-mail, but plugging the full name (Otto Ernst Heinrich Klemperer) back into Google does yield a website with the birth and death years:

http://klemperer.co.uk/SophieKoner/index.htm

Otto Klemperer (1899-1987) - the physicist Otto Klemperer (1885-1973) - the conductor

But rather embarrassingly, I missed that someone has kindly created an article on the physicist:

http://en.wikipedia.org/wiki/Otto_Klemperer_%28physicist%29

Prompted by my posting here.

Thanks! :-)

Carcharoth

Newyorkbrad

16 Jan 16 Jan

8:37 p.m.

Interesting thread and questions. A related question though is whether unfettered eternal searchability of the Internet is unambiguously a good thing. Take the types of BLP, privacy, etc. issues we deal with everyday on Wikipedia, and extrapolate them to the rest of the 'net....

Newyorkbrad

On 1/14/11, Carcharoth carcharothwp@googlemail.com wrote:

...

(Following on from another thread)

I have a theory that Wikipedia makes only *part* of the Internet not suck. Wikipedians aggregate online knowledge (and offline as well, but let's stick to online here), thus making it easier to find information about something, especially when there are lots of ambiguous hits on a Google search and you don't know enough to refine the search. But the useful parts of the internet (i.e. not the social media and similarly non-transient information-deficient areas of the internet) didn't stop growing when Wikipedia came along.

In theory, if the growth of the information-dense parts of the internet has continued to outstrip the growth of Wikipedia and the ability of Wikipedians to aggregate that knowledge base, then large parts of that part of the internet should still "suck" (to continue using that terminology) - i.e. be less amenable to searching due to absence of information or poorly organised information. I base this on many years of searching daily for information about topics ranging from the well-known to the borderline obscure to the outright obscure.

Over the years since Wikipedia started, the ability to find information online has changed beyond recognition. Around about 2004-5 (I need to check dates here), Wikipedia was rising rapidly up the search rankings, and now comes top or near the top on most searches. But there are still many, many topics on which no articles, or only redlinks, exist. I come across these daily when searching, and see that information on these topics is out there, scattered around if you search on Google, but hasn't been aggregated yet.

The question I have is whether the growth in the amount of unaggregated information (and I include other information-organising sites here, not just Wikipedia) will always outstrip the ability of various processes (include the growth of Wikipedia) to aggregate it into something more useful? If the long-term answer is yes, then information overload is inevitable (and search engines will gradually start to suck again). If the long-term answer is no, then at some point the online aggregation (or co-ordination of data to form information in the real sense) will start to overtake the flow of information from offline to online, and order will continue to emerge from the (relative) chaos.

The key seems to be the quality of the information put online. Well-organised and searchable sites and databases are good. Poorly organised information sources, less so, as while they can in theory be found by search engines, they may be less easy to distinguish from the background noise, though it also depends greatly on the amount of information you start with when carrying out a search for more information.

To take a specific example, I very occasionally come across names of people or topics where it is next-to-impossible to find out anything meaningful about them because the name is identical to that of someone else. Sometimes this is companies that name themselves after something well-known and any search is swamped by hits to that well-known namesake. Other times, it is someone more famous swamping a relatively obscure person - a recent example I found here is the physicist Otto Klemperer. Despite having the name and profession, it is remarkably difficult to find information about the physicist as opposed to the composer. If I had a birth year, it would be much easier, of course.

Carcharoth

WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l

Tony Sidaway

17 Jan 17 Jan

5:16 a.m.

I think the point is being missed. Wikipedia does not set out to manipulate search engine results, that's just a happy accident of its content being pretty good and many search engines weighting its content appropriately.

We make the internet not suck by putting the information on our website, maintaining it and permitting its re-use and modification subject to a reasonable licence. Our method of organization is thus an alternative to using a search engine. It's far more modest than google because it's not trying to aggregate everything that's out there. People work on what they find interesting and use the resources they know about.

All anybody needs to know is: Wikipedia exists and it can be found by all decent search engines. Its content is indexed by the same search engines so it's easy to narrow down a search to prioritize content from Wikipedia.

I remember talking to a TV journalist about 15 years ago, no stranger to online life. When I said how useful I found Altavista, the Google of its time, he lamented that it didn't work, because new websites were being created so quickly there was no possibility it could ever keep its indexes up to date. Again; completely missing the point. We don't need to be able to find every single thing on the internet, only the useful stuff. A huge amount of the useful stuff is on Wikipedia.

Charles Matthews

4:47 p.m.

On 16/01/2011 23:46, Tony Sidaway wrote:

...

We don't need to be able to find every single thing on the internet, only the useful stuff. A huge amount of the useful stuff is on Wikipedia.

This is true, but not particularly "objective". The OP's question itself has merit. The long-term view surely must depend on whether [[Moore's law]] is with us or against us on this issue, for example. The "content" of the Web in particular is limited only by the number of hard drives that can be lashed onto it. The idea that a search engine company could download "all" webpages so as to have a local copy to work on may some day seem laughably naive (and I believe is already obsolescent).

Starting to play with ideas, and taking Carcharoth's main point to be about "cruft" (we have the concept, and do at least try to bear down on cruft on enWP), you get the distinctions cruft/non-cruft on utility, shallow/deep (Berners-Lee), structured data/non-structured. WP is very much shallow-Web, anti-cruft but not snobbishly so, and semi-structured (we have infoboxes even if some of us regard them as trojan horses for tendentiousness). That says a bit more objectively what "useful" might mean, at least.

Charles

Carcharoth

7:01 p.m.

On Mon, Jan 17, 2011 at 11:17 AM, Charles Matthews charles.r.matthews@ntlworld.com wrote:

...

On 16/01/2011 23:46, Tony Sidaway wrote:

> We

...
don't need to be able to find every single thing on the internet, only the useful stuff. A huge amount of the useful stuff is on Wikipedia.

This is true, but not particularly "objective". The OP's question itself has merit. The long-term view surely must depend on whether [[Moore's law]] is with us or against us on this issue, for example. The "content" of the Web in particular is limited only by the number of hard drives that can be lashed onto it. The idea that a search engine company could download "all" webpages so as to have a local copy to work on may some day seem laughably naive (and I believe is already obsolescent).

Yes. My point was not that we would want to find every single thing on the internet (I was careful to exclude large parts of the internet), but my point was that the useful parts of the internet (which Wikipedia would want to draw upon when constructing new pages or updating existing ones) might be, or have always been, expanding faster than it is possible to keep up with it. Traditional encyclopedias, with a print deadline, didn't have this problem. My question is whether it is possible to attain the "sum of all human knowledge". Aristotle was said to be the "last person to know everything there was to be known in his own time". Retaining the caveats of "useful", can Wikipedia become a pseudo-Aristotle, or is that an unattainable goal?

Talking about the true size of the internet, I'm reminded of the concept of the Deep Web:

http://en.wikipedia.org/wiki/Deep_Web

Carcharoth

Tony Sidaway

9 p.m.

I suppose my problem here is understanding how the discussion goes from <the useful part of the web is expanding faster than we can keep up> to <there is a problem with this>.

On deep and semantic web, these are useful concepts that will help us to develop more capable data mining tools, but not essential for our task at hand, which is to present a particular subset of structured, organized human knowledge.

Knowledge is social. We evaluate data as part of a collaboration (Wikipedia merely provides a framework for exploiting this universal human activity). It is unavoidable and irreducible. There is nowhere online a hidden trove of knowledge that we can use without first exposing it to evaluation. And we already have far more potentially useful data than we can ever evaluate so it's a bit pointless worrying about the invisible net in general. Better to use top down methods to identify likely sources (some of which are currently invisible).

Charles Matthews

18 Jan 18 Jan

4:26 p.m.

On 17/01/2011 15:30, Tony Sidaway wrote:

...

I suppose my problem here is understanding how the discussion goes from<the useful part of the web is expanding faster than we can keep up> to<there is a problem with this>.

I believe the "mission statement" approach to WP would necessarily find troubles with this phenomenon. Of course we can take the "sum of all knowledge" (online and offline) with a pinch of salt; that's what mission statements are for. But notice that the built-in inclusionism of addressing the issue that way has the practical effect of forcing us to build up expertise and criteria. CSD and notability guidelines are there to solve (for example) the issue of "garage bands with a MySpace page aren't necessarily encyclopedic", but not that issue alone. Across broad areas some sifting goes on.

...

On deep and semantic web, these are useful concepts that will help us to develop more capable data mining tools, but not essential for our task at hand, which is to present a particular subset of structured, organized human knowledge.

We must look both at the "blue sky research" approach, and the pragmatic business of presenting a properly edited and categorised piece of hypertext to the world, in real time. If we treat the mining options as essentially irrelevant, we are planning our own obsolescence.

...

Knowledge is social. We evaluate data as part of a collaboration (Wikipedia merely provides a framework for exploiting this universal human activity). It is unavoidable and irreducible. There is nowhere online a hidden trove of knowledge that we can use without first exposing it to evaluation. And we already have far more potentially useful data than we can ever evaluate so it's a bit pointless worrying about the invisible net in general. Better to use top down methods to identify likely sources (some of which are currently invisible).

Well, I agree with the last part, since it fits in with my approach as of today. WP can still usefully gobble down existing old reference material, and if that is done by making it visible on Wikisource on the way, so much the better. Given the reactions of others to this concept, I think you'd be wise to admit that "evaluation" is pluralistic in nature.

Charles

Tony Sidaway

19 Jan 19 Jan

5:35 a.m.

On 18 January 2011 10:56, Charles Matthews charles.r.matthews@ntlworld.com wrote:

...

On 17/01/2011 15:30, Tony Sidaway wrote:

...
I suppose my problem here is understanding how the discussion goes from<the useful part of the web is expanding faster than we can keep up> to<there is a problem with this>.

I believe the "mission statement" approach to WP would necessarily find troubles with this phenomenon. Of course we can take the "sum of all knowledge" (online and offline) with a pinch of salt; that's what mission statements are for. But notice that the built-in inclusionism of addressing the issue that way has the practical effect of forcing us to build up expertise and criteria. CSD and notability guidelines are there to solve (for example) the issue of "garage bands with a MySpace page aren't necessarily encyclopedic", but not that issue alone. Across broad areas some sifting goes on.

Well, I think you answered the implicit question: naive "mission statements" involving terms like "sum of all knowledge" aren't of much practical use.

...

...
On deep and semantic web, these are useful concepts that will help us to develop more capable data mining tools, but not essential for our task at hand, which is to present a particular subset of structured, organized human knowledge.

We must look both at the "blue sky research" approach, and the pragmatic business of presenting a properly edited and categorised piece of hypertext to the world, in real time. If we treat the mining options as essentially irrelevant, we are planning our own obsolescence.

No complaints there. We can continue writing an old fashioned encyclopedia or (one day) become more semantically oriented, or whatever else comes along.

My issue with the semantic option as it stands at present is that it's incompatible with our current goal. It would be fine to write a search engine to wander off and aggregate all cricket statistics to produce the ultimate cricket encyclopedia, but we can't do that without a reliable free source. The free semantic infrastructure doesn't exist, and that's even before we work out how we assess the reliability of the information from various sources. It's neither essential for the task in hand, nor is it clear to me that it will ever be something this project can do. Perhaps in ten years time my qualms will appear laughable but for now the semantic web hasn't yet encountered its equivalent of the Codd revolution that has made modern databases such a doddle, and we're talking about much more ambitious processes than those described by relational calculus.

...

...
Knowledge is social. We evaluate data as part of a collaboration (Wikipedia merely provides a framework for exploiting this universal human activity). It is unavoidable and irreducible. There is nowhere online a hidden trove of knowledge that we can use without first exposing it to evaluation. And we already have far more potentially useful data than we can ever evaluate so it's a bit pointless worrying about the invisible net in general. Better to use top down methods to identify likely sources (some of which are currently invisible).

Well, I agree with the last part, since it fits in with my approach as of today. WP can still usefully gobble down existing old reference material, and if that is done by making it visible on Wikisource on the way, so much the better. Given the reactions of others to this concept, I think you'd be wise to admit that "evaluation" is pluralistic in nature.

Pluralistic as in social, or pluralistic as in multi-faceted? Either way, no argument there.

Charles Matthews

20 Jan 20 Jan

12:36 a.m.

On 19/01/2011 00:05, Tony Sidaway wrote:

...

On 18 January 2011 10:56, Charles Matthews charles.r.matthews@ntlworld.com wrote:

...
On 17/01/2011 15:30, Tony Sidaway wrote:

...
I suppose my problem here is understanding how the discussion goes from<the useful part of the web is expanding faster than we can keep up> to<there is a problem with this>.

I believe the "mission statement" approach to WP would necessarily find troubles with this phenomenon. Of course we can take the "sum of all knowledge" (online and offline) with a pinch of salt; that's what mission statements are for. But notice that the built-in inclusionism of addressing the issue that way has the practical effect of forcing us to build up expertise and criteria. CSD and notability guidelines are there to solve (for example) the issue of "garage bands with a MySpace page aren't necessarily encyclopedic", but not that issue alone. Across broad areas some sifting goes on.

Well, I think you answered the implicit question: naive "mission statements" involving terms like "sum of all knowledge" aren't of much practical use.

Questionable, really. WP's success has a lot to do with combining the naive outlook with the practical: making the idealistic outline into comprehensible activities people can actually go and do right now. Remember the site has got this far without any management to speak of (as far as content is concerned) ...

Charles

5046

Age (days ago)

5050

Last active (days ago)

wikien-l@lists.wikimedia.org

12 comments

6 participants

tags (0)

participants (6)

Carcharoth
Charles Matthews
David Gerard
Ian Woollard
Newyorkbrad
Tony Sidaway