Abstract
We examine stylochronometry, the question of measuring change in linguistic style over time within an authorial canon and in relation to change in language in general use over a contemporaneous period. We take the works of two prolific authors from the 19th/20th century, Henry James and Mark Twain, and identify variables that change for them over time. We present a method of analysis applying regression on linguistic variables in predicting a temporal variable. In order to identify individual authors’ effects on the model, we compare the model based on the novelists’ works to a model based on a 19th/20th century American English reference set. We evaluate using \(R^2\) and Root mean square error (RMSE), that indicates the average error on predicting the year. On the two-author data, we achieve an RMSE of \(\pm \)7.2 years on unseen data (baseline: \(\pm \)13.2); for the larger reference set, our model obtains an RMSE of \(\pm \)4 on unseen data (baseline: \(\pm \)17).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Early-Wittgenstein may be stylistically as well as conceptually distinct from Late-Wittgenstein.
- 3.
ANalysis Of VAriance (ANOVA) is a collection of methods developed by R.A. Fisher to analyze differences within and between different groups. Principal Component Analysis (PCA) is an unsupervised statistical technique to convert a set of possibly related variables to a new uncorrelated representation or principal components.
- 4.
The coefficient of determination \(R^2\) indicates how well a model fits the observed data ranging from 0 to 1–0 indicating a poor fit and 1 a perfect one; in the case of evaluating predictions against the outcome (test set) values can also range from \(-1\) to 1;–in the case of negative values, the mean of the data provides a better fit.
- 5.
Distinctiveness Ratio: Measure of variability defined by the rate of occurrence of a word in a text divided by its rate of occurrence in another.
- 6.
Here, we only include the main works/novels for reasons of text length and genre homogeneity.
- 7.
- 8.
- 9.
The separate entries are created using the POS tags assigned by the tagger to the individual word entity in its context.
- 10.
This is without loss of generality to the bag-of-words analysis of texts in which sentence structures are not used subsequent to POS tagging.
- 11.
Lexical features are continuous here because we use relative frequencies.
- 12.
This applies if it is meaningful to count instances of the variable, as it is for token n-grams: such relativization does not apply, for example, to average word lengths.
- 13.
This is not to argue that complementary categories (e.g. relativized counts of features that are not shared between both authors over the entire duration or features that are never shared by the authors over the duration, etc.) are uninteresting. However, for this work we are addressing change in language shared by the two authors and relative to change in background language of their time, thinking that this provides an interesting perspective on their distinctiveness from each other and everyone else.
- 14.
The t-value measures the size of the difference between an observed sample statistic and its hypothesized population parameter relative to the variation in the sample data. The further the t-value falls on either side of the t-distribution, the greater the evidence against the null hypothesis that there is no significant difference between hypothesized and observed value.
- 15.
In this case, the Akaike information criterion (AIC) is used to evaluate the model: \(AIC = - 2*log L + 2k\), where L is the likelihood and k the number of estimated parameters in the model. Thus, AIC rewards goodness-of-fit, but penalizes the number of parameters in the model.
- 16.
All models reported on here had reliable \(\bar{R}^2\) values at a level of a p-value \({<}0.0001\) associated to them, so we dispense with reporting on this in each individual case.
- 17.
\(RMSE = \sqrt{\frac{\sum ^{n}_{t=1} (\hat{y}_t - y)^2}{n}}\).
- 18.
This might not be an entirely realistic scenario in that most predictors, even randomly selected ones, will bear some kind of relation with the response. However, in the case of the test set, the wrong predictors can also have a worse effect than the null-model, so this might be an acceptable approximation.
- 19.
The system reports estimates and predictions as decimals; we dispense with reporting these here, as texts were only ordered according to year rather than exact month, which renders those numbers meaningless. \(R^2\) and RMSE are on the basis of rounded versions of predictions.
- 20.
This can be tested by using the variable inflation factor (VIF) that measures how much the variance of the estimated coefficients in regression is inflated compared to when the predictors are not linearly related; a value of 1–4 indicating low correlation and 5–10 high correlation.
References
Ayres, A.: The Wit and Wisdom of Mark Twain. Harper Collins, New York (2010)
Beach, J.W.: The Method of Henry James. Yale University Press, New Haven (1918)
Brooks, V.W.: The ordeal of mark twain. William Heineman, London (1922)
Can, F., Patton, J.M.: Change of writing style with time. Comput. Humanit. 38(1), 61–82 (2004)
Canby, H.S.: Turn West, Turn East: Mark Twain and Henry James. Biblo & Tannen Publishers, New York (1951)
Coleman, C.: The Treatise Lorenzo Valla on the Donation of Constantine: Text and Translation. First published 1922. Russell & Russell, New York (1971)
Daelemans, W.: Explanation in computational stylometry. In: Computational Linguistics and Intelligent Text Processing, pp. 451–462. Springer, New York (2013)
Davies, M.: The Corpus of Historical American English: 400 million words,1810–2009. http://corpus.byu.edu/coha/, vol. 24, p. 2011 (2010). Accessed 24 Aug 2015
Frontini, F., Lynch, G., Vogel, C.: Revisiting the donation of constantine. In: Kibble, R., Rauchas. S. (eds.) Artificial Intelligence and Simulation of Behavior—Symposium: Style in Text, vol. 2008, pp. 1–9 (2008)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Hoover, D.L.: Corpus stylistics, stylometry, and the styles of Henry James. In: Style 41.2 (2007)
Jed Wing, M.K.C., et al.: caret: Classification and Regression Training. R package version 6.0–30, http://CRAN.R-project.org/package=caret (2014). Accessed 24 Aug 2015
Kemper, S., et al.: Language decline across the life span: findings from the Nun Study. In: Psychology and aging 16.2, p. 227 (2001)
Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting Methods and Applications. Wiley, New York (2008)
Michalke, M.: koRpus: An R Package for Text Analysis. Version 0.04-40. http://reaktanz.de/?c=hacking&s=koRpus (2013). Accessed 24 Aug 2015
Pena, E.A., Slate, E.H.: gvlma: Global Validation of Linear Models Assumptions. R package version 1.0.0.2. http://CRAN.R-project.org/package=gvlma (2004). Accessed 24 Aug 2015
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.r-project (2014). Accessed 24 Aug 2015
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing, vol. 12, pp. 44–49. Manchester (1994)
Stamou, C.: Stylochronometry: stylistic development, sequence of composition, and relative dating. Lit. Linguist. Comput. 23(2), 181–199 (2008)
Acknowledgments
This research is supported by Science Foundation Ireland through the CNGL Programme (Grant 12/CE/I2267) in the ADAPT Centre (www.adaptcentre.ie) at Trinity College Dublin.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Klaussner, C., Vogel, C. (2015). Stylochronometry: Timeline Prediction in Stylometric Analysis. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXXII. SGAI 2015. Springer, Cham. https://doi.org/10.1007/978-3-319-25032-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-25032-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25030-4
Online ISBN: 978-3-319-25032-8
eBook Packages: Computer ScienceComputer Science (R0)