Why is it not a good idea to generate paragraphs from a Twitter stream about a selected topic?
I have been working on this problem quite a long time. As a writer and author, I am reasonably well versed in copyright issues. And as a Twitter bot maker for some years, I am also familiar with the vicissitudes of the Twitter gods. In terms of Plagiarism, I am reminded what my father once said, tongue in cheek, “everyone steals, it’s called research”.
Perhaps rightfully, Twitter takes a dim view of ripping off other people’s tweets verbatim, and will shut this kind of automation down, sooner or later. Personally, I don’t consider text of less than one sentence to be copyright infringement. And, tweets are rarely complete sentences. More often than not, tweets are incoherent, in terms of proper English. In which case, Text normalization is the order of the day.
I have spent a good deal of effort trying to normalize tweets into proper and complete sentences (in order to feed them into chatbots, or dialog systems). The only success I’ve had with this has been using elaborate regex arrays; for instance, I lost some 2000x cloud pipelines when Yahoo! Pipes went out of business recently. Regex normalization would qualify as a form of traditional rule based AI. However, what I discovered was that this sort of normalization is very close to Natural language generation.
In short, especially using n-gram based Twitter data, I do not think there are any ethical issues in re-purposing this essentially junk into something more coherent, topic based or not. Also, according to my experience, there are not practical issues associated with feeding differentiated amalgamations back into Twitter.