Hello Ziko,
Thanks for your mail! I responded inline below.
On 6 April 2018 at 03:04, Ziko van Dijk zvandijk@gmail.com wrote:
Hello,
A most interesting thread, as it touches the topic from different angles. I agree that it needs actually a study among readers about their preferences.
As I mentioned to Leila, the ESWC paper does work with editors, but I agree, more thought and work should be done on actual Wikipedia readers.
Personally, I may have some doubt whether it improves an ArticlePlaceholder to create sentences from the data (as they did in the geographical "articles" created by bots). The data itself is most suitable for databases, to be looked up in a table. Reading "Berlin has 3,500,000 million inhabitants" is not really an improvement compared to "Berlin / inhabitants: 3,500,000".
Sentences have the most power when they combine information to knowledge, like in "Berlin's population, currently 3,500,000, has been much different during the Cold War because of the declining attractiveness for businesses".
In general, I would advise against one-sentence-summaries; a reader might be disappointed when he comes via Google to a website and then only finds one sentence.
Just to clarify: the summaries do generate information from multiple triples. Basically means, the sentences are a bit more complex than just verbalizing one triple per sentence. However, even with a neural network, there is a limit to how much context we can produce for each sentence. Therefore, we integrated the question of how editors work with the data, as we see it an important aspect of the workflow. Basically, ArticlePlaceholder can be a better option than no information at all, but still the ideal would be an actual editor picking up a topic and writing and maintaining a full article. Furthermore, in our current (theoretical) design we still keep all the information available from Wikidata in forms of triples. Therefore, we don't replace any information, we just add a sentence that's more reader friendly and gives a first overview, before looking at pure triples.
(I hope I understood the question well; I cannot follow the math in your article. Is there anywhere an example of your "summaries" to read?)
The summaries are learned from the first sentence of Wikipedia, therefore they contain the same kind of structure and content. If you're able to read Arabic or Esperanto, generated sentences can be found here: https://github.com/pvougiou/Mind-the-Language-Gap/tree/master/Results/Our%20...
Cheers, Lucie
2018-04-05 22:50 GMT+02:00 Leila Zia leila@wikimedia.org:
Hi Lucie-Aimée,
Nice to see work in this direction is progressing. Some comments in-line.
On Wed, Apr 4, 2018 at 7:49 AM, Lucie-Aimée Kaffee kaffee@soton.ac.uk wrote:
Therefore, we worked on producing sentences from the information on Wikidata in the given language. We trained a neural network model, the details can be found in the preprint of the NAACL paper here: https://arxiv.org/abs/1803.07116
It would be good to do human (both readers and editors, and perhaps both sets) evaluations for this research, too, to better understand how well the model is doing from the perspective of the experienced editors in some of the smaller languages as well as their readers. (I acknowledge that finding experienced editors when you go to small languages can become hard.)
Furthermore, we would love to hear your input: Do you believe, one
sentence
summaries are enough, can we serve the communities needs better with
more
than one sentence?
This is a hard question to answer. :) The answer may rely on many factors including the language you want to implement such a system in and the expectation the users of the language have in terms of online content available to them in their language.
Is this still true if longer abstracts would be of lower text quality?
same as above. You are signing yourself up for more experiments. ;)
I would be interested to know:
- What is the perception of the readers of a given language about
Wikipedia if a lot of articles that they go to in their language have one sentence (to a good extent accurate), a few sentences but with some errors, more sentences with more errors, versus not finding the article they're interested in at all?
- Related to the above: what is the error threshold beyond which the
brand perceptions will turn negative (to be defined: may be by measuring if the user returns in the coming week or month.)? This may well be different in different languages and cultures.
- Depending on the result of the above, we may want to look at
offering the user the option to access that information, but outside of Wikipedia, or inside Wikipedia but very clearly labeled as Machine Generated as you do to some extent in these projects.
What other interesting use cases for such a technology in the Wikimedia world can you imagine?
The technology itself can have a variety of use-cases, including providing captions or summaries of photos even without layers of image processing applied to them.
Best, Leila
[1] https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder and https://commons.wikimedia.org/wiki/File:Generating_Article_
Placeholders_from_Wikidata_for_Wikipedia_-_Increasing_ Access_to_Free_and_Open_Knowledge.pdf
[2] https://eprints.soton.ac.uk/413433/1/Open_Sym_Short_Paper_
Wikidata_Multilingual.pdf
-- Lucie-Aimée Kaffee Web and Internet Science Group School of Electronics and Computer Science University of Southampton _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thank you Lucie, for taking the effort to answer in detail. As I said, I am afraid I cannot really understand your paper as I come from the humanities. And of course, a study about reader expectations was not part of your paper and research. For me personally, I would start there, and I know that Wikipedia research had always more attention for contributors than for readers.
You are opening a new issue actually: what is useful for readers, that is one thing. The other thing is: does an ArticlePlaceholder help an editor to improve an article. I would suppose that it is best to start the article on your own, but that may depend on the topic of the article.
I do speak Esperanto, by chance. :-) https://eo.wikipedia.org/wiki/Uzanto:Ziko
Kind regards, Ziko
Lucie-Aimée Kaffee kaffee@soton.ac.uk schrieb am Sa. 7. Apr. 2018 um 16:24:
Hello Ziko,
Thanks for your mail! I responded inline below.
On 6 April 2018 at 03:04, Ziko van Dijk zvandijk@gmail.com wrote:
Hello,
A most interesting thread, as it touches the topic from different
angles. I
agree that it needs actually a study among readers about their
preferences.
As I mentioned to Leila, the ESWC paper does work with editors, but I agree, more thought and work should be done on actual Wikipedia readers.
Personally, I may have some doubt whether it improves an
ArticlePlaceholder
to create sentences from the data (as they did in the geographical "articles" created by bots). The data itself is most suitable for databases, to be looked up in a table. Reading "Berlin has 3,500,000 million inhabitants" is not really an improvement compared to "Berlin / inhabitants: 3,500,000".
Sentences have the most power when they combine information to knowledge, like in "Berlin's population, currently 3,500,000, has been much
different
during the Cold War because of the declining attractiveness for businesses".
In general, I would advise against one-sentence-summaries; a reader might be disappointed when he comes via Google to a website and then only finds one sentence.
Just to clarify: the summaries do generate information from multiple triples. Basically means, the sentences are a bit more complex than just verbalizing one triple per sentence. However, even with a neural network, there is a limit to how much context we can produce for each sentence. Therefore, we integrated the question of how editors work with the data, as we see it an important aspect of the workflow. Basically, ArticlePlaceholder can be a better option than no information at all, but still the ideal would be an actual editor picking up a topic and writing and maintaining a full article. Furthermore, in our current (theoretical) design we still keep all the information available from Wikidata in forms of triples. Therefore, we don't replace any information, we just add a sentence that's more reader friendly and gives a first overview, before looking at pure triples.
(I hope I understood the question well; I cannot follow the math in your article. Is there anywhere an example of your "summaries" to read?)
The summaries are learned from the first sentence of Wikipedia, therefore they contain the same kind of structure and content. If you're able to read Arabic or Esperanto, generated sentences can be found here:
https://github.com/pvougiou/Mind-the-Language-Gap/tree/master/Results/Our%20...
Cheers, Lucie
2018-04-05 22:50 GMT+02:00 Leila Zia leila@wikimedia.org:
Hi Lucie-Aimée,
Nice to see work in this direction is progressing. Some comments
in-line.
On Wed, Apr 4, 2018 at 7:49 AM, Lucie-Aimée Kaffee <kaffee@soton.ac.uk
wrote:
Therefore, we worked on producing sentences from the information on Wikidata in the given language. We trained a neural network model,
the
details can be found in the preprint of the NAACL paper here: https://arxiv.org/abs/1803.07116
It would be good to do human (both readers and editors, and perhaps both sets) evaluations for this research, too, to better understand how well the model is doing from the perspective of the experienced editors in some of the smaller languages as well as their readers. (I acknowledge that finding experienced editors when you go to small languages can become hard.)
Furthermore, we would love to hear your input: Do you believe, one
sentence
summaries are enough, can we serve the communities needs better with
more
than one sentence?
This is a hard question to answer. :) The answer may rely on many factors including the language you want to implement such a system in and the expectation the users of the language have in terms of online content available to them in their language.
Is this still true if longer abstracts would be of lower text quality?
same as above. You are signing yourself up for more experiments. ;)
I would be interested to know:
- What is the perception of the readers of a given language about
Wikipedia if a lot of articles that they go to in their language have one sentence (to a good extent accurate), a few sentences but with some errors, more sentences with more errors, versus not finding the article they're interested in at all?
- Related to the above: what is the error threshold beyond which the
brand perceptions will turn negative (to be defined: may be by measuring if the user returns in the coming week or month.)? This may well be different in different languages and cultures.
- Depending on the result of the above, we may want to look at
offering the user the option to access that information, but outside of Wikipedia, or inside Wikipedia but very clearly labeled as Machine Generated as you do to some extent in these projects.
What other interesting use cases for such a technology in the Wikimedia world can you imagine?
The technology itself can have a variety of use-cases, including providing captions or summaries of photos even without layers of image processing applied to them.
Best, Leila
[1] https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder and https://commons.wikimedia.org/wiki/File:Generating_Article_
Placeholders_from_Wikidata_for_Wikipedia_-_Increasing_ Access_to_Free_and_Open_Knowledge.pdf
[2] https://eprints.soton.ac.uk/413433/1/Open_Sym_Short_Paper_
Wikidata_Multilingual.pdf
-- Lucie-Aimée Kaffee Web and Internet Science Group School of Electronics and Computer Science University of Southampton _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Lucie-Aimée Kaffee Web and Internet Science Group School of Electronics and Computer Science University of Southampton _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org