On these topics, I also found articles about the related topics of “language change” [7] and “historical linguistics” [8].
In these regards, understanding languages to be organic and dynamic, I think that we scientists and scholars should, over the course of time, measure and evaluate the outputs of large-scale NLG systems, e.g., Abstract Wikipedia, to compare these generated corpora to other human- and machine-generated corpora. For instance, how might various measurements of the English generated by Abstract Wikipedia compare to measurements of the English Wikipedia as a corpus?
It is interesting to consider which varieties of measurements, evaluations, and analytics will be useful for NLG development teams – measuring objectivity and subjectivity is but one topic of many. It is interesting to consider how NLG development teams will best go about utilizing these measurements, evaluations, and analytics when versioning documents, software logic, and data to hone and to finetune system outputs.
Best regards,
Adam
[7] https://en.wikipedia.org/wiki/Language_change
[8] https://en.wikipedia.org/wiki/Historical_linguistics
From: Adam Sobieski
Sent: Tuesday, October 5, 2021 3:37 PM
To: General public mailing list for the discussion of Abstract Wikipedia and Wikifunctions
Subject: [Abstract-wikipedia] Re: Objectivity and Subjectivity in Computational Historical Narration
Doug,
Thank you. Technologies like BERT, GPT-3, and Lamda [1] are, indeed, impressive and it is also interesting to consider what might be on the horizon in AI.
The thoughts on language that you shared remind of “social constructionism” [2] and “lexical entrainment” [3]. You also indicate language as dynamic and “relative in time and location”, which reminds of philology [4] and cognitive philology
[5].
On the topics of computational historical narration – generating objective history-educational documents from structured (wiki)data – my opinions include that we can measure subjectivity and objectivity increasingly well, e.g., with frame
analysis and sentiment analysis, and that these measurements and evaluations can be of use for training AI, for implementing NLG algorithms, and for implementing NLU technologies, applications of NLU including word-processing-related tools, products, and services
(see also: [6]).
Best regards,
Adam
[1] https://blog.google/technology/ai/lamda/
[2]
https://en.wikipedia.org/wiki/Social_constructionism
[3]
https://en.wikipedia.org/wiki/Lexical_entrainment
[4] https://en.wikipedia.org/wiki/Philology
[5]
https://en.wikipedia.org/wiki/Cognitive_philology
[6] https://github.com/w3c/document-services
From: Douglas Clark
Sent: Tuesday, October 5, 2021 11:52 AM
To: General public mailing list for the discussion of Abstract Wikipedia and Wikifunctions
Subject: [Abstract-wikipedia] Re: Objectivity and Subjectivity in Computational Historical Narration
Adam,
I like to think of language as a field [1]. Each discrete meaning occupies a position within the field. Each discreet meaning is a concept (not word) that can be conveyed with any word grouping as long as the meaning is the same. A concept
that appears the same from a word structure, but is different with context is a separate meaning. “Bring me the server,” can refer to a restaurant setting as well as an IT setting. They are two very separate meanings. Even at the most abstracted level, one
is bring an object, and the other bring a person. With use, meanings accrete influence following Zipf’s inverse power law [2]. Context is a binding force within the field. A language field is inert and completely objective until acted upon by an observer.
Each communication updates the field meanings and modifies the contextual binding force. Fields nest from a person’s, to a family’s language field to a field of legal jargon to fields of slang to the field of all human communication. They all follow Zipf’s
Law and they all use context to limit meaning choices to drive understanding.
When observer A wants to communicate their subjective reality, the context they set within the shared language field constrains the possible direction of the communication. For centuries, pizza probably had zero contextual relationship
to pedophillia, yet now has a fairly strong contextual binding. A discussion about taking a walk has massive contextual binding to meanings with outdoor settings. Conversely, a conversation is not very likely to transition next to the topic of SpaceX’s upcoming
launch with an initial context of making baked beans. It’s much more likely to include topics such as fire up the BBQ and get the corn bread out of the oven.
So while the initial observer’s selection of context is driven by a subjective input, the attraction (I think of it as meaning gravity) of high use meanings (Zipf’s’s Law) and contextual constraints for follow-on topics morphs communication
toward objectivity among the participating observers. For any outside observer, the communication would still be perceived as subjective, since they did not participate in the communication by setting their own contexts. However, if a linguist came along 100
years later, the communication would appear as an objective documentation of the communication event. So, language use is also relative in time and location.
So that’s a bunch of stuff to say using machine learning to generate language will be as objective as the field(s) ingested for both training and building the model. If the ML uses teen slang for training, it will not perform well for aerospace
uses, but it will objectively represent the current state of teen slang. Observers constrained to the contexts of teen slang would find it exceedingly difficult to use the ML products for microbiology. Further, if the field was of 1950’s teen slang, today’s
teens would find little use in any ML product.
Most training datasets in use today, to include GPT-3, use huge corpora of words and word n-grams. Context is set by word distances. Those rules do no represent the language field I have described. The rules are abstracted and artificial.
I do not choose language by paying attention to word spacing. Observers do not communicate with words, they communicate with concepts in contexts. That’s why word choice many times does not matter, if the concept and the context are the same. Shakespeare got
it:
“What's in a name? That which we call a rose
By any other word would smell as sweet..."
That’s why I advocate for the building of the universal language field. The only way to objectively generate language for all observings is to build a 99.9999% near real time complete field of human communication and then let context metadata
resolve the language to the appropriate sub field(s). We have language fields going back to at least 3100 BCE. ML is great at using categories (metadata) to parse data into meaningful models. For example, there are indications that communication that that
includes long distances between concepts (with respect to the field), or those with low context bindings may indicate the presence of misinformation. We would see that instantly if we had ML searching for communication meeting those criteria. It could also
signal a breakthrough in some human endeavor as connecting seemingly unrelated things is a hallmark of innovation. We could use algorithms like Polya’s Urn to help identify innovating communication. Lastly, we have dictionaries, we have encyclopedias, we have
thesauri, but we do not have a reference of human communication. We couldn’t before because the curation was beyond a lifetime of effort. We can now.
Doug
On Wed, Sep 29, 2021 at 7:09 PM Adam Sobieski <adamsobieski@hotmail.com> wrote:
Wikidata,
Abstract Wikipedia,
Hello. I am recently thinking about objectivity and subjectivity with respect to natural language generation, in particular in the contexts of story generation using historical data [1][2].
In the near future, digital humanities scholars – in particular historians – could modify collections of data and finetune generation-related parameters, watching as resultant multimodal historical narratives emerged and varied. In these regards, we can envision both computer-aided and automated historical narrative generation tools and technologies.
Could AI be a long-sought objective narrator for historians? Is all narration, or all language use, inherently subjective? What might the nature of “generation-related parameters” and “finetuning” be for style and subjectivity [3][4][5][6][7][8] when generating natural language and multimodal historical narratives from historical data [1][2]?
Thank you. Hopefully, these topics are interesting.
Best regards,
Adam Sobieski
[1] Metilli, Daniele, Valentina Bartalesi, and Carlo Meghini. "A Wikidata-based tool for building and visualising narratives." International Journal on Digital Libraries 20, no. 4 (2019): 417-432.
[2] Metilli, Daniele, Valentina Bartalesi, Carlo Meghini, and Nicola Aloia. "Populating narratives using Wikidata events: An initial experiment." In Italian Research Conference on Digital Libraries, pp. 159-166. Springer, Cham, 2019.
[3] https://en.wikipedia.org/wiki/Subjectivity
[4] https://en.wikipedia.org/wiki/Objectivity_(philosophy)
[5] https://en.wikipedia.org/wiki/Political_subjectivity
[6] https://en.wikipedia.org/wiki/Framing_(social_sciences)
[7] https://en.wikipedia.org/wiki/Focalisation
[8] https://en.wikipedia.org/wiki/Point_of_view_(philosophy)
_______________________________________________
Abstract-Wikipedia mailing list --
abstract-wikipedia@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimedia.org/