V.P.Fomenko, T.G.Fomenko, A.T.Fomenko
AUTHOR'S INVARIANT FOR RUSSIAN LITERARY TEXTS

4. NUMERICAL EXPERIMENT. LIST OF ANALYZED PARAMETERS.

To find the author's invariant, the "unconscious parameter", which is out of authors control, we gave analyzed the following quantitative characteristics of the texts.

1) LENGTH OF THE SENTENCES, i.e. mean number of words in sentence (calculated for each consecutive sample).

2) LENGTH OF THE WORDS, i.e. mean number of the syllables in the word (calculated for each consecutive sample).

3) GENERAL FREQUENCY OF USING OF THE FORM-WORDS, NAMELY, PREPOSITIONS, CONJUNCTIONS, PARTICLES , i.e. the percentage of all mentioned form-words in each consecutive sample.

4) FREQUENCY OF NOUNS, i.e. their percentage in each consecutive sample.

5) FREQUENCY OF VERBS, i.e. their percentage in each consecutive sample.

6) FREQUENCY OF AJECTIVES, i.e. their percentage in each consecutive sample.

7) FREQUENCY OF PREPOSITION "V" (engl. in), i.e. its percentage in each consecutive sample.

8) FREQUENCY OF PARTICLE "NE" (engl. not), i.e. its percentage in each consecutive sample.

9) NUMBER OF FORM-WORDS IN SENTENCE, i.e. the mean number of prepositions, conjunctions, particles in the sentences.

Some of these parameters were analyzed earlier by some scientists. But the parameter 3 - frequency of all form-words - is new, in our opinion. We have suggested it on the basis of general idea of volume functions for narrative texts.

These parameters Nos.1-9 are quite different. Our new parameter No.3 is characterized by it "integral property". It is "massive", because we calculate the total frequency (percentage) of ALL FORM-WORDS in the samples! Let us recall that total amount of all form-words is sufficiently large. This is the reason - why it is very hard (in our opinion practically impossible) for the author to control this parameter No.3 "on the conscious level". The writer (author) can more or less control, for example, the length of his sentences in the book. But it is very hard to image some writer who can keep in his mind the frequency of all form-words which he use.

The parameters No.7 - frequency of the preposition "v" (engl. in) - and No.8 - frequency of the particle "ne" (engl. not) - describe the distribution of the individual form-words. They are not so "massive" as the general, integral parameter No.3. We have included the parameters Nos.7,8 in our list to check - are they stable and can they be used as author's invariants? As we have found, the answer is negative.

The parameter No.9 - the mean number of the form-words in the sentence - is in some sense integral, "massive", but the numerical experiments show that this parameter depends on the length of sentences. But this length is very unstable and can vary in considerable limits, without any stabilization.

We put in the basis of our experiment the method of samples from general population. The value of the step, - i.e. the interval between two neighboring samples - was varied from 10 to 60 pages of standard book-pages for the books of large volume. The size of the sample also was varied. In the previous papers of many scientists was usually chosen the size about 1000 words. We started from the size 2000 words. Then we step by step increased the size of the sample: 4000, 8000, 16000 words.

Our numerical experiment showed that further increase of the size of the sample is not necessary, because the author's invariant was found for the size of the sample 16000 words.

Of course, when we have analyzed the texts of relative low volume, the value of the steps and the size of the samples were also decreased. By the way, we have found, that the value of steps practically does not influence of the final results. From the other hand, the size of the sample is really very important.

We have chosen the following principle as the simple criteria for stabilization. We increased the size of the sample until the moment when we found the parameter, whose mean value of its deviations from the mean values of the parameter through the texts of the authors becomes sufficiently less that the amplitude of the oscillations of this parameter between different authors.

In other words, we calculated for each author the deviation of the parameter from its mean value. Then we calculated the mean size of these deviations for all authors (included in the experiment). We tried to find the parameter, whose "mean size of deviations" is sufficiently less that the difference between the maximal and minimal values of the parameter on the set of authors under consideration.

5. LIST OF RUSSIAN AUTHORS AND THEIR TEXTS ANALYZED IN OUR EXPERIMENT

We have chosen (in more or less random way) 23 Russian writers from XVIII, XIX and XX centuries, who wrote in Russian and who had the texts of large size. See the list below. For each of these writers we have analyzed more or less all his main books. It turned out that the results obtained do not depend on the volume of the texts in case when these texts really are sufficiently large. Now let us list the texts analyzed in our experiment. We include in the list the dates of the writer's life and the date, when this text was written (sometimes we mention the date of the first publication of the text). Then we mention the place and the date of the modern publication of the text which we have used.

RUSSIAN AUTHORS OF XVIII CENTURY

1) CHULKOV M.D. (1743-1792) - novel "Prigojaya povariha" (1770), Moscow, 1971.

2) NOVIKOV N.I. (1744-1818) - satiric magazine "Jivopisec", published in 1772-1773, Moscow, 1971.

3) FONVIZIN D.I. (1745-1792) - "Zapiski pervogo puteshestviya" (written in 1777-1778), novel "Povestvovanie gluhogo i nemogo" (published in 1783), novel "Kalisfen" (published in 1786), novel "Drug chestnych ludei ili starodum" (written in 1830), memorial "Chistoserdechnoe priznanie v delah i pomyshleniyach" (published in 1830), Moscow, 1971.

4) RADISCHEV A.N. (1749-1802) - novel "Puteshestvie iz Peterburga v Moskvu" (published in 1790), Moscow, 1971.

5) KARAMZIN N.M. (1766-1826) - "Istoria Gosudarstva Rossii'sckogo" (written in 1816-1826), novel "Bednaya Liza" (published in 1792), novel "Ostrov Bernholm" (published in 1794), novel "Marfa Posadskaya" (published in 1803), Moscow, 1971.

6) KRYLOV I.A. (1769-1844) - novel "Kaib" (published in 1792), "Pohval'naya rech'" (published in 1792), Moscow, 1971.

RUSSIAN AUTHORS OF XIX CENTURY

7) GOGOL' N.V. (1809-1852) - novels: "Vechera na chotore bliz Dikan'ki", "Sorochinskaya yarmarka", "Vecher nakanune Ivana Kupala", "Mai'skaya noch' ili utoplenniza", "Propavshaya gramota", "Noch' pered Rojdestvom", "Strashnaya mest'", "Ivan Ivanovich i ego tetushka", "Zakoldovannoe mesto" (published in 1831-1832), novels: "Mirgorod", "Starosvetskie pomeschiki", "Taras Bul'ba", "Vii'", "Povest' o tom, kak possorilis' Ivan Ivanovich s Ivanom Nikiforovichem" (published in 1835), "Povesti" (Peterburgskie): "Nevskii' Prospekt", "Nos", "Portnoi'", "Shinel'", "Kolyaska", "Zapiski sumashedshego", "Rim" (published in 1833-1842), poem "Mertvye dushi" (published in 1840), Moscow, 1959, 1971.

8) GERZEN A.I. (1812-1870) - memorial "Byloe i dumy" (published in 1852-1868), Moscow, 1969.

9) GONCHAROV I.A. (1812-1891) - novel "Obyknovennaya istoria" (published in 1847), novel "Oblomov" (published in 1859), novel "Obryv" (published in 1869), Moscow, 1959.

10) TURGENEV I.S. (1818-1883) - "Zapiski ohotnika" (written in 1855-1856), novel "Rudin" (written in 1855-1856), novel "Dvoryanskoe gnezdo" (written in 1859), novel "Nakanune" (written in 1860), novel "Otzy i deti" (written in 1862), Moscow, 1961.

11) MEL"NIKOV-PECHERSKII' P.I. (1818-1883) - "Krasil'nikovy" (travelling story, 1852), story "Dedushka Polikarp" (written in 1857), story "Poyarkov" (written in 1857), story "Starye gody" (written in 1857), novel "V lesah" (written in 1871-1875), Moscow, 1963.

12) DOSTOEVSKII" F.M. (1821-1881) - novel "Prestuplenie i nakazanie" (written in 1866), novel "Brat'ya Karamazovy" (written in 1879-1880), Moscow, 1970-1973.

13) SALTYKOV-SCHEDRIN M.E. (1826-1889) - "Istoria odnogo goroda" (written in 1869-1870), novel "Gospoda Golovlevy" (written in 1875-1880), Moscow, 1975.

14) LESKOV N.S. (1831-1895) - novel "Ledi Makbet Mzenskogo uezda" (written in 1864), novel "Voitel'niza" (written in 1866), novel "Zapechatlennyi' angel" (written in 1873), novel "Ocharovannyi' strannik" (written in 1873), story "Zheleznaya volya" (written in 1876), story "Odnodum" (written in 1879), story "Nesmertel'nyi' golovan" (written in 1880), story "Levsha" (written in 1881), story "Tupei'nyi' hudozhnik" (written in 1883), story "Chelovek na chasah" (written in 1889), story "Zimnii' den'" (written in 1894), Moscow, 1973.

15) TOLSTOI' L.N. (1828-1910) - novel "Detstvo" (written in 1852), novel "Otrochestvo" (written in 1854), novel "Yunost'" (written in 1856), story "Nabeg" (written in 1852), novel "Utro pomeschika" (written in 1856), novel "Kazaki" (written in 1863-1869), novel "Anna Karenina" (written in 1873-1877), noval "Voskresenie" (written in 1899), Moscow, 1960-1964.

RUSSIAN AUTHORS OF XX CENTURY

16) GOR'KII' A.M. (1868-1936) - story "Makar Chudra" (written in 1892), story "Ded Archip i Len'ka" (written in 1894), story "Staruha Izergil'" (written in 1894-1895), story "Oshibka" (written in 1895), story "Odnazhdy noch'iu" (written in 1895), soty "Ozornik" (written in 1896), story "Tovarischi" (written in 1897), story "Suprugi Orlovy" (written in 1897), story "Byvshie ludi" (written in 1897), story "Mal'va" (written in 1897), story "Skuki radi" (written in 1897), story "Varen'ka Olesova" (written in 1898), story "Druzhki" (written in 1898), story "Chitatel'" (written in 1898), Moscow, 1939. Then: novel "Detstvo" (written in 1912-1913), novel "V ludyah" (written in 1923), novel "Delo Artamonovyh" (written in 1925), Moscow, 1967.

17) BUNIN I.A. (1870-1953) - story "Antonovskie yabloki" (written in 1900), novel "Derevnya" (written in 1909-1910), novel "Suhodol" (written in 1911), story "Zahar Vorobjev" (written in 1911-1912), story "Brat'ya" (written in 1916), story "Gospodin iz San-Francisco" (written in 1915), story "Bogj'e derevo" (written in 1913), story "Natali" (written in 1941), story "Chistyi' ponedel'nik" (written in 1944), Moscow, 1973.

18) NOVIKOV-PRIBOI' A.S. (1877-1944) - story "Po-temnomu" (written in 1911), story "Boi'nya" (written in 1906), story "Poshutili" (written in 1913), story "Porchennyi'" (written in 1912), novel "More zovet" (written in 1919), novel "Kapitan pervogo ranga" (written in 1936-1944), novel "Zusima" (written in 1905-1941), Moscow, 1963.

19) FEDIN K.A. (1892-1977) - novel "Goroda i gody" (written in 1924), novel "Brat'ya" (written in 1928), Moscow, 1974.

20) LEONOV L.M. (birth. 1899) - novel "Russkii' les" (written in 1953), Moscow, 1974.

21) SHISHKOV V.Ya. (1873-1945) - novel "Tai'ga" (written in 1916), novel "Pei'nus-ozero" (written in 1931), novel "Ugrum-reka" (written in 1918-1932), Moscow, 1960.

22) FADEEV A.A. (1901-1956) - novel "Razgrom" (written in 1926), novel "Molodaya gvardiya" (written in 1945).

23) SHOLOHOV M.A. (1905-1984) - complete collection of the writings in 8 volumes, Moscow, 1962: early stories - volume 1, novel "Tihii' Don" (Quiet Don) - volumes 2-5, novel "Podnyataya zelina" - volumes 6-7, stories - volume 8.

6. NUMERICAL EXPERIMENT

For each of the writers listed above, we have analyzed all the mentioned texts. Namely, through all these books and writings we have calculated the values of all nine mentioned above numerical linguistic parameters. In 1974-1977 we have constructed a huge set of frequency graphs for all consecutive samples of the size 2000, 4000, 8000, 16000 words. This large work was done "by hand", without computer, because we did not had the electronic versions of these Russian writings. We are not sure that today there are suitable electronic realizations of these Russian texts. Of course, for the purpose of future investigations, such electronic texts will be very helpful.

The principle of the construction of the frequency graphs was as follows. We marked along the horizontal line the numbers of consecutive samples, and along the vertical line - the values of corresponding linguistic parameters. As a result, we obtain some curve on the plane. The fluctuations of the parameters, their deviations from the mean value, were estimated by the following formula:

d = (H.max - H.min) / H.mean

where H.max, H.min, H.mean - maximal, minimal and mean values of the parameter.

7. RESULTS OF EXPERIMENT

It turned out, that all the listed above parameters, - except of the parameter No.3 - demonstrate the following behavior when we increase the size of the samples. Either they have not stabilization, or there is stabilization, but for all the authors this "stable value" is practically the same. In other words, in this last case, "all the authors are glued", i.e. it is impossible to separate them one from another on the basis of such parameter. In such case this "common stable value" is the characteristics of the Russian language, but not the authors.

The typical example of the first situation (when we do not obtain any stabilization when size of the sample grows) is the evolution of the parameter No.1 - the number of the words in sentences (Fig.2). It is clearly seen, that even for the size of the samples in 16000 words, all the curves are still chaotic, completely random, and their fluctuations are too large.

Fig.2

The typical example of the second situation (when we obtain "the gluing of all authors") is the evolution of the parameter No.2, namely - the number of the syllables in the words (Fig.3). Here for the size of samples in 16000 words we see some stabilization. The curves became more of less flat, similar to "straight lines", but all these curves are practically identical, they amalgamate into one curve. Thus, we cannot distinguish different authors on the basis of this parameter.

Fig.3

The analogous picture appears also for parameters Nos.4,5,6,7,8,9. For example, the curves for parameter No.8 have not stabilization and are mixed. The behavior of the parameter No.8 is similar to the parameter No.2. In other words, here we got stabilization (for large size of the samples), but the curves for all different authors became so close, that they oscillate near one and the same value. Consequently, this "limit value" is some characteristics of the Russian language itself, but not of the authors. Individual specifics of different authors do not influence on this parameter, after its stabilization.

Resume: the parameters Nos.1,2,4,5,6,7,8,9 do not give any real basis for the solution of the problem of finding the author's invariant.

8. FREQUENCY OF FORM-WORDS IS THE AUTHOR'S INVARIANT

Remarkable exception is the parameter No.3 - frequency of all form-words: prepositions, conjunctions, particles. Evolution of this parameter with respect to the grows of the size of the samples is demonstrated in the Figs.4,5,6,7. In our first experiment we include in the list of the form-words the following words.
Fig.4
Fig.5
Fig.6
Fig.7

PREPOSITIONS: "v" (engl. in), "na" (engl. on), "s" (engl. with), "za" (engl. after, behind), "k" (engl. to), "po" (engl. around), "iz" (engl. from), "u" (engl. by), "ot" (engl. against, from, of), "dlya" (engl. for, behalf), "vo" (engl. in), "bez" (engl. without), "do" (engl. before), "o" (engl. about, of, on), "cherez" (engl. across, via), "so" (engl. with, from, against, off), "pri" (engl. at, by, under), "pro" (engl. about, for), "ob" (engl. about, of), "ko" (engl. to, toward, near, unto), "nad" (engl. above, over), "iz-za" (engl. for), "iz-pod" (engl. from under), "pod" (engl. under, below, bottom).

CONJUNCTIONS: "i" (engl. and), "chto" (engl. what, how), "no" (engl. but), "a" (engl. while, and, but, yet, if), "da" (engl. and, but), "chotya" (engl. though), "kogda" (engl. when), "chtoby" (engl. that, in order), "esli" (engl. if), "tozhe" (engl. also), "ili" (engl. or), "to est'" (engl. that is), "zato" (engl. but, on the other hand), "budto" (engl. as if, as though).

PARTICLES: "ne" (engl. not), "kak" (engl. so), "zhe" (engl. and, as for, the same), "dazhe" (engl. even), "by" (engl. who/ever, what/ever, when/ever), "li" (engl. whether, if), "tol'ko" (engl. only), "vot" (engl. there, here), "to" (engl. just, precisely), "ni" (engl. not a), "lish'" (engl. only, as soon as), "ved'" (engl. you see, you know), "von" (engl. there, over there), "to-est'" (engl. that is), "nibud'" (for example, "kak-nubud'" = engl. some, some kind of, about), "uzhe" (engl. yet), "libo" (engl. either... or).

Totally, we have considered 55 different form-words. Though this list is incomplete, it turned out, that it is enough to distinguish authors.

IMPORTANT EXPERIMENTAL FACT

1) For all authors, listed above (except one writer, whose texts will be discussed later), the frequency of form-words, - i.e. the parameter No.3, - becomes stable for the size of sample equal to 16000 words. This means, that the corresponding curves became practically "straight lines". This is true for 22 (from 23) Russian writers with sufficiently large texts, chosen in a random way (Fig.7). In other words, the dynamic parameter No.3 become practically constant through the texts of each author.

Fig.7

2) Difference between maximal and minimal values of the parameter No.3 is sufficiently more that the amplitude of its fluctuations inside the texts of one writer. Here we consider minimum and maximum for all 23 writers under consideration. More concrete, the maximum value is equal 27,5%, the minimum value is equal to 19% (Fig.7). It follows from this fact, that the parameter No.3 really distinguish many authors.

This is the reason, why we now can call the parameter No.3 AUTHOR'S INVARIANT.

This parameter can be used to check plagiarism, or for attribution of unknown texts. But, of course, we need to be especially careful in the last case, because there are writers with very close values of author's invariant. For example, Fonvizin D.I. and Tolstoi' L.N. (see below). Besides this, for correct conclusions we need in the texts of sufficiently large volume.

The main corollary is the nontrivial statement about the existence at least one author's invariant for Russian literary texts. It would be very interesting to continue these experiments for finding another author's invariants.

Let us underline, that similar conclusions we can formulate only after large numerical experiments. Only when we have proved by experiment, that some dynamic parameter really becomes stable inside the texts of large collection of the writers, we can claim that we have discovered the author's invariant. Let us note that the list of the analyzed writers should be sufficiently large - about several dozens of the authors.

It is interesting, that our author's invariant (general frequency of all form-words) "works" for the writers of several centuries: in our list are presented the authors from XVIII century until XX century.