V.P.Fomenko, T.G.Fomenko, A.T.Fomenko
AUTHOR'S INVARIANT FOR RUSSIAN LITERARY TEXTS

9. NUMERICAL EXAMPLES

Because, as we found, the main interest have the graphs for the samples of the volume in 16000 words, we will consider only this case.

Let us demonstrate the table of several tested parameters for the case of Turgenev I.S. and Tolstoi' L.N., namely, for:

No.3 = frequency (percentage) of all form-words,
No.1 = number of the words in the sentence,
No.2 = number of the syllables in the word,
No.9 = number of form-words in the sentence,
No.7 = frequency (percentage) of the preposition "v" (engl. in),
No.8 = frequency (percentage) of the particle "ne" (engl. not).

It is clear, that the smallest deviations have the parameters Nos.3 and 2, namely, 0,016 and 0,023 for Turgenev, and 0,020 and 0,08 for Tolstoi'. But the parameter No.2 cannot be considered as author's invariant, because its values for almost all writers from our list are practically the same ("coincide"). For example, we have 2,17 for Turgenev and 2,16 for Tolstoi'. This means, that from the point of view of the parameter No.2, all the authors "are glued into one author". We cannot distinguish them.

It turns out, that parameter No.3 - frequency of all form-words - is not only the author's invariant, but distinguish many authors. For example, it is equal to 22,24 for Turgenev, and 23,62 for Tolstoi'. The difference is about 1,38. This is more than the fluctuations of this parameter inside the texts of Turgenev and Tolstoi'.

For the collection of the writers, which were analyzed in out experiment, the values of parameter No.3 vary from 19,4% to 27,5%, i.e. its scale is sufficiently more that the mean fluctuation of this parameter inside the texts of all these authors (except one).

Let us demonstrate the table of values for parameters Nos.3,7,8 for Gogol', Gerzen, Dostoevskii', Leonov and Fadeev.

Let us demonstrate the table of values for parameters Nos.3,1,2,9 for Goncharov and Leskov.

The texts of Gor'kii' are characterized by high stability of the parameter No.3, namely: 22,02, 22,21, 22,20, 22,17 et cetera. Mean value is equal to 22,15 and deviation is equal 0,009.

Because the parameter No.3 - frequency of form-words - is characterized by its remarkable stability and by "separating property for different authors", it is very interesting to analyze its fluctuation as the function of the size of samples. Let us demonstrate the table showing the dependence of the deviation (from the mean value) on the different sample size.

As we see from the Table, the stabilization of the parameter No.3 sometimes begins from the sample size less than 16000 words. This is especially true for the Russian writers of XVIII century. For example, for Karamzin the stabilization of the author's invariant starts from sample size equal to 8000 words, for Fonvizin - also from 8000 words. This fact possible indicate the large stability of Russian authors style in XVIII century, especially in comparison with the authors of XIX and XX centuries.

This fact, - early stabilization of the invariant, - shows that in some cases we can use the author's invariant (frequency of all form-words) also for the texts of small volume. But for the large-scale research it is necessary to use the samples of the size in 16000 words, because only for such sample size the stabilization of the parameter No.3 occurs simultaneously for all writers under investigation.

After the finding of author's invariant for 22 writers listed above, we have extended our experiment on texts five for writers: A.N.Ostrovskii', A.K.Tolstoi', V.A.Zhukovskii', A.S.Pushkin and A.P.Chehov. We selected their prosy texts of large volumes. This extended experiment had confirmed the high stability of the parameter No.3 for the sample size in 16000 words, and confirmed its ability to distinguish different groups of the authors. Thus, the list of the writers, for whom the parameter No.3 turns out to be stable, was extended from 22 to 27.

10. HOW WE CAN APPLY THE DISCOVERED AUTHOR'S INVARIANT?

One of the possible applications of this author's invariant if the recognition of the plagiarism and possible establishing of attribution of the texts. We can suggest the following natural method. If for some two texts under investigation the values of the parameter No.3 (frequency of form-words) are different more than 1%, then there is the serious argument to attribute these texts to two different authors. The more is the difference, the more is the suspicion.

From the other hand (as in the problem of determination of the fatherhood), the close values of the author's invariant do not mean that the analyzed texts are written by one and the same author. As we have noted, there are the writers with similar values of the invariant. For example, for Leonov and Fadeev their values of the invariant are respectively 23,08 and 23,40.

Besides that, we need to be very careful in application of this method to the texts of low volume. Let us demonstrate the difficulties, appearing in this situation, on the example of the large and small texts of A.P.Chehov. His parameter No.3 (frequency of all form-words) was calculated by us through all his texts in the complete edition of his writings in Moscow, 1960-1964. It tuned out, that his dynamic parameter has the following behavior.

The difference between the values of the parameter No.3 for the early small Chehov's stories (the volumes I-V) and large novels of latest period of his activity (volumes VI-VIII) is sufficiently large (Fig.8).
Fig.8
In his early texts the frequency of form-words is less. Moreover, the fluctuations of this frequency are here more, that fluctuations in the large, latest texts. In other words, the large (latest) Chehov's writings are characterized by higher stability of the author's invariant. Let us recall once more, that the same is correct for the large texts of all others 26 writers from our list. In this sense, Chehov is not the exception - the parameter No.3 "works very well" for all his large writings.

In the conclusion, let us note the interesting fact. It turns out, that the percentage of all form-words is most stable (for the sample size in 8000 and 16000 words) for the prosy texts, and is less stable for the poetry. It would be very interesting to analyze this effect more precisely.

The finding of the author's invariant for the Russian literary texts supports the conjecture about the existence of analogous author's invariants for another languages. Of course, they can differ from the frequency of all form-words. The special interest have the author's invariants for Greek and Latin languages. They will help in the recognition author's problem for some ancient texts.

11. STATISTICAL ANALYSIS OF SHOLOHOV'S WRITINGS

The reader possibly noted, that from the list of 28 Russian authors we did not discussed one writer. This is Mihail Alexandrovich Sholohov, whose writing we will analyze now. All our conclusions we base on the analysis of all his writings, edited in 8 volumes in Moscow at 1962. Let us note, that we do not claim here some final statements and we publish our results in the hope, that they will be useful for the scientist analysing the Sholohov's writings.

It is well known, that M.A.Sholohov reached the distinguished position in the Russian and world literature. Nobel prize in 1965 also confirms his international recognition.

Nevertheless, during several dozen years, in Russia and in the West, among some part of the experts were claimed the doubts concerning the following problem - is Sholohov real author of novel the Quiet Don? Some experts claim that the real author was Cossack writer Fedor Dmitrievich Krukov (1870-1920), who soldier in the White Army and died from the camp-fever in 1920. Let us underline, that we do not plan to support here these experts, or agree with their opponents. We only want to present the results of our statistical analysis, and hope that can be interesting for all experts working with Sholohov's and Krukov's texts.

Let us shortly recall the essence of the discussion. It is well known, that during the First World War and civil war in Russia Fedor Krukov wrote many stories about Don Cossacks. After his death, as claimed the author under pseudonym "D" (his analysis of the Quiet Don [11] was published in 1974), Krukov's manuscript of the Quiet Don, - and possibly, some other materials, - were taken by Sholohov, who made some corrections, dumped down the Cossack's nationalism in the Krukov's original by more pro-Soviet opinions, and then published the novel under his own name.

Further, "D" claimed, that the language and style of Krukov's texts have considerable similarity with the language and the style of the Quiet Don. In his opinion, about 95% of the books I and II of the Quiet Don, and 68-70% of the books III and IV belong to Krukov, and Sholohov can be considered, in the best case, only as coauthor. According to "D", it should be also mentioned, that Krukov was certainly Cossack's writer and consequently knew very well the life and history of Cossacks.

In the Introduction to the book of "D", A.Solzhenizyn wrote: "From its appearing in 1928, the Quiet Don started the sequence of riddles, which remain unexplained until today. The situation, which never was before in the world literature, appeared in the literary community. 23-years old debutante created the writing on the basis of the material, which widely exceeds his own life experience and his educational level (four classes only). Young commissar, then Moscow laborer and functionary in the house-manager's office on Krasnaya Presnya, he published the work, which can be prepared only after long contacts with many layers of the pre-revolutionary Don community. The novel startled the readers by the deep understanding of the everyday life and psychology of these layers".

The statements of "D" were sharply criticized by Ermolaev [15], [16]. From the other hand, the conclusions of "D" were supported by A.Solzhenizyn and R.Medvedev. By the way, as noted the authors of the book [18], in May 1990 N.N.Struve - the editor of the book "Stremya Tyhogo Dona", opened the pseudonym "D". In turned out, that the author "D" is the well known specialist in literature I.N.Medvedeva-Tomashevskaya [18], p.7.

In 1991 was published the book of A.G.Makarov and S.E.Makarova "Zvetok-tatarnik. (Flower-Tartar. To the source of The Quiet Don)" [18]. The authors have analyzed the language of the novel, its historical and chronological contents, and have compared the novel with writings of other authors. A.G.Makarov and S.E.Makarova conclude that Sholohov adapted and published under his own name the writing of some another author. It should be mentioned that at first time Sholohov was accused in plagiarism in 1928, when he published the first two books of the Quiet Don. The claim that real author is Krukov, was also formulated by Krukov's relatives, but their beefs were rejected because of lack of direct proofs.

It is clear, that different rumors and suspicions cannot play the role of proofs. We need in real facts and special statistical analysis. In order to clarify the situation, two Sweden and two Norwegian scientists have analyzed Sholohov's texts by computer methods [10], [13], [14]. See details in the book [10], published in 1984, and in Russian translation, in 1989. They came to the conclusion, - based on the analysis of different frequency characteristics, namely, length of the sentences, length of the words et cetera, - that all parts of the Quiet Don belong to one and the same author.

But, as we have checked in our experiments, these (and all analogous) parameters either are not stable, or are insufficiently sensitive for authors recognition. It is easy to see this effect, when you compare the length of the sentences and the words through the all Sholohov's writings. We have used the 8-volume edition of 1962.

We see that the mean number of the words in the sentences has considerable fluctuations (thus - no invariant). From the other hand, the mean number of the syllables in the words is more or less constant, and one can make the conclusion "in behalf of Sholohov". But such a conclusion would be too hasty and wrong, because, - as we already know, - THIS PARAMETER IS NOT THE AUTHOR'S INVARIANT.

Let us note, that scientists mentioned above (see [10]), did not found our invariant (the percentage of all form-words). They also did not found any other real author's invariant, whose efficiency was checked and based on the statistical analysis of large number of the different Russian writers.

Because we have the sufficiently effective author's invariant, it is natural to apply our method in case of Sholohov. After analysis of all works in this subject (which were accessible for us), we have realized that usually experts tried to compare the individual writings Sholohov and Krukov only, on the base different frequency characteristics. Then they formulated some arguments in behalf either Sholohov, or Krukov (or some other candidates).

But these experts usually did not formulated the basic problem: are frequency characteristics, which they have used, real author's invariants? It is senseless to analyze the plagiarism problem without such invariant. On the first step we need to find the real author's invariant on basis of analysis of several tens different writers (as we have made in our experiment). And only after this we can apply the found stable frequency characteristics for the solution of some plagiarism problems. In other words, initially we need to construct the "tools", and only then to apply them in practice.

This is our way of analysis. Because we have found the stable author's invariant, we can apply it to the Quiet Don problem. The result obtained is very interesting. The percentage of form-words in Sholohov's writings turns out to be so DIFFERENT, that we were forced to introduce "two Sholohovs", in other words, to separate his writings into two groups, which were possibly written by two authors. We conditionally named these authors "Sholohov I" and "suspected Sholohov II". The exact result is given in the Fig.5 and in the following Table.

Fig.5

Sholohov's writings Frequency of form-words (%)
Early stories 22,46
Quiet Don, books I and II, parts 1-5 and beginning of part 6 in book III 19,55
Quiet Don, continuation of book III and whole book IV, i.e. continuation of part 6 and parts 7-8 22,69
Podnyataya Celina, books I and II. 23,07
Latest stories and novels 24,37
Essays, papers, speeches 23,35

See more detailed Table in the end of the paper. It is seen from the Table, that we can formulate three important conclusions.

1) We can attribute to Sholohov I the following texts:
a) his early stories,
b) the last piece of 6-th part, final parts 7 and 8 of the Quiet Don,
c) all his consequtive writings, i.e. Podnyataya Celina, late novels and stories.

2) WE SHOULD ATTRIBUTE TO SUSPECTED SHOLOHOV II THE PARTS 1,2,3,4,5 AND THE BEGINNING OF THE 6-TH PART IN THE QUIET DON.

3) 6-th part of the Quiet Don occupies the "middle position" between writings of Sholohov I and suspected Sholohov II. First its pages (about 100 pages) evidently belong, - from the point of view of form-words frequency, - to the suspected Sholohov II. But others pages in 6-th part belong to Sholohov I.

It is completely clear from the Table and the Fig.9, that the language style of EARLY Sholohov's stories (1924-1927) practically coincides (from the point of view of form-words frequency) with the style of LAST parts 7,8 of the Quiet Don, and all others Sholohov's writings, which were written and published later.

Fig.9

In the parts 1,2,3,4,5 and in the beginning of 6-th part of the Quiet Don, the percentage of form-words is about 19,55. But in all others Sholohov's writings, - early and latest, - this characteristics is equal to 23,03 % .

The difference in 3,48 % between the values of author's invariant for Sholohov I and suspected Sholohov II (Fig.9) IS SO LARGE, that it is impossible to neglect it. IT IS VERY HARD TO ATTRIBUTE THESE TEXTS TO ONE AND THE SAME AUTHOR.

OUR CONCLUSION

STATISTICAL RESULTS, OBTAINED ON THE BASIS OF THE AUTHOR'S INVARIANT, CONFIRM THE CONJECTURE, THAT THE PARTS 1,2,3,4,5 AND CONSIDERABLE PIECE OF THE PART 6 OF THE QUIET DON WERE WRITTEN NOT BY M.A.SHOLOHOV.

May be, one can object, that Sholohov suddenly changed his style, when we wrote the parts 1-5 of the Quiet Don. May be he started his career as the writer with one value of author's invariant, then "changed style" in the parts 1-5 of the Quiet Don, but then returned again to his PREVIOUS manner of writing.

May be. But in this case we are forced to conclude, that Sholohov becomes remarkable unique and strange phenomena in the whole Russian literature. And this mystic phenomena is so shocking, that it needs in special analysis. Because Sholohov "succeeded" in the effect, which turned to be impossible for all Russian writers XVII-XX centuries, who were analysed in our experiment. He drastically changed his own author's invariant!

Really, as we saw, all 27 Russian writers with LARGE novels, from different centuries and from different literary schools, during the whole life did not changed their style, namely, their author's invariant (frequency of form-words). But Sholohov on the period of one-two years suddenly (and very seriously) "changed his style". Moreover, he was able to keep this seriously "changed value" of the author's invariant during the huge parts 1,2,3,4,5 of the Quiet Don. But we noted before, that the percentage of form-words is integral, massive parameter, and consequently, is (in the large extent) "unconscious" parameter of the personality. As our numerical experiment has demonstrated, it is practically impossible to control this parameter on the "conscious level". At least, it was impossible for all 27 Russian writers under consideration.

The example with change of the style by Chehov (see above) has quite different nature, because in this case we see the difference between SMALL Chehov's stories and his LARGE writings. But in the case of Sholohov we are in the range of his LARGE writings.

Let us decompose the total number of form-words in Sholohov's writings in the "sum" of prepositions, conjunctions and particles, considering separately. Then, it turns out, that Sholohov I has only a little less PREPOSITIONS, than suspected Sholohov II. But Sholohov I has sufficiently more CONJUNCTIONS and PARTICLES, than suspected Sholohov II. See the Table.

This fact again demonstrates the DEEP DIFFERENCE between Sholohov's I texts and the texts of suspected Sholohov II.

It should be mentioned a good correspondence between our result and independent conclusion made by the expert "D" on the base of quite different arguments. Namely, that the book I, II and the beginning of the book III were written not by Sholohov. From the other hand, "D" claimed that about 70 % of the books III and IV were also written not by Sholohov. But we have obtain that the book IV and considerable part of the book III are nevertheless characterized by Sholohov's value of the author's invariant.