Text als Daten

Extraktion von Variablen mittels LSTM-Netzwerken

Authors

  • Hendrik Erz Institut für Analytische Soziologie (IAS), Linköping University
  • Anastasia Menshikova Institut für Analytische Soziologie (IAS), Linköping Universität

Keywords:

Textanalyse, Natural Language Processing, LSTM, Gender, Künstliche Intelligenz, Machine Learning

Abstract

Dieser Artikel stellt eine neue Methode vor, die Text direkt in quantitative Variablen umwandelt. Mithilfe von neuronalen Netzwerken des Typs „Long Short-Term Memory“ (LSTM, Hochreiter und Schmidhuber 1997) nach Komninos und Manandhar (2016) kann sie latente Informationen aus Text in nominal-skalierte Variablen überführen. Wir testen unsere Methode mit einer Fallstudie und untersuchen, ob weiblich gelesene Personen seltener handeln als männlich gelesene Personen (vgl. auch Garg et al. 2018). Wir können zeigen, dass weiblich gelesene Personen insgesamt seltener als Akteure in Erscheinung treten als männlich gelesene. Diese ersten Ergebnisse zeigen, dass es mit diesem Ansatz möglich ist, Text automatisiert in Daten umzuwandeln und somit neue Wege für die quantitative Sozialforschung zu eröffnen.

References

Baur, Nina, und Jörg Blasius, Hrsg. 2019. Handbuch Methoden der empirischen Sozialforschung. Wiesbaden: Springer Fachmedien Wiesbaden.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major und Margaret Mitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ????. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. Virtual Event Canada: ACM https://dl.acm.org/doi/10.1145/3442188.3445922 (Zugegriffen: 20. Apr. 2021).

Blei, David M, Andrew Y. Ng und Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3:993–1022.

Bonikowski, Bart, Yuchen Luo und Oscar Stuhler. 2022. Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in U.S. Presidential Campaigns (1952–2020) with Neural Language Models. Sociological Methods & Research 51:1721–1787.

Breiman, Leo. 2001. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science 16:199–231.

Brown, Tom B. et al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs].

Chang, Kent K., und Simon DeDeo. 2020. Divergence and the Complexity of Difference in Text and Culture. Journal of Cultural Analytics 4(11):1–36.

Däubler, Thomas, und Kenneth Benoit. 2022. Scaling hand-coded political texts to learn more about left-right policy content. Party Politics 28:834–844.

Do, Salomé, Étienne Ollion und Rubing Shen. 2022. The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy. Sociological Methods & Research doi: 00491241221134526.

Eisenstein, Jacob. 2018. Natural Language Processing.

Franzosi, Roberto. 1989. From Words to Numbers: A Generalized and Linguistics-Based Coding Procedure for Collecting Textual Data. Sociological Methodology 19:263–298.

Garg, Nikhil, Londa Schiebinger, Dan Jurafsky und James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences 115:E3635–E3644.

Gentzkow, Matthew, Jesse M. Shapiro und Matt Taddy. 2019. Measuring Group Differences in High‐Dimensional Choices: Method and Application to Congressional Speech. Econometrica 87:1307–1340.

Gerow, Aaron, Yuening Hu, Jordan Boyd-Graber, David M. Blei und James A. Evans. 2018. Measuring discursive influence across scholarship. Proceedings of the National Academy of Sciences 115:3308–3313.

Grimmer, Justin, Margaret E. Roberts und Brandon M. Stewart. 2022. Text as data: a new framework for machine learning and the social sciences. Princeton, New Jersey Oxford: Princeton University Press.

Hochreiter, Sepp, und Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9:1735–1780.

Jurafsky, Daniel, und James H. Martin. 2020. Speech and Language Processing (DRAFT). https://web.stanford.edu/~jurafsky/slp3/ (Zugegriffen: 1. Aug. 2023).

Knight, Carly. 2022. When Corporations Are People: Agent Talk and the Development of Organizational Actorhood, 1890–1934. Sociological Methods & Research doi: 00491241221122528.

Komninos, Alexandros, und Suresh Manandhar. 2016. Dependency Based Embeddings for Sentence Classification Tasks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1490–1500. San Diego, California: Association for Computational Linguistics https://www.aclweb.org/anthology/N16-1175 (Zugegriffen: 13. Feb. 2021).

Kozlowski, Austin C., Matt Taddy und James A. Evans. 2019. The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review 84:905–949.

Lankford, Seamus, Haithem Alfi und Andy Way. 2021. Transformers for Low-Resource Languages: Is Féidir Linn! In: Proceedings of Machine Translation Summit XVIII: Research Track, 48–60. Virtual: Association for Machine Translation in the Americas https://aclanthology.org/2021.mtsummit-research.5 (Zugegriffen: 1. Feb. 2023).

Lazer, David, und Jason Radford. 2017. Data ex Machina: Introduction to Big Data. Annual Review of Sociology 43:19–39.

Lebret, Rémi, David Grangier und Michael Auli. 2016. Neural Text Generation from Structured Data with Application to the Biography Domain. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1203–1213. Austin, Texas: Association for Computational Linguistics https://aclanthology.org/D16-1128 (Zugegriffen: 31. Jan. 2022).

Levy, Omer, und Yoav Goldberg. 2014. Dependency-Based Word Embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–308. Baltimore, Maryland: Association for Computational Linguistics https://www.aclweb.org/anthology/P14-2050 (Zugegriffen: 13. Feb. 2021).

Metz, Cade. 2016. An Infusion of AI Makes Google Translate More Powerful Than Ever. Wired, September https://www.wired.com/2016/09/google-claims-ai-breakthrough-machine-translation/ (Zugegriffen: 1. Feb. 2023).

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado und Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26:3111–3119.

Mohr, John W. et al. 2020. Measuring Culture. New York: Columbia University Press.

Mohr, John W. 1998. Measuring Meaning Structures. Annual Review of Sociology 24:345–370.

Mohr, John W., Robin Wagner-Pacifici, Ronald L. Breiger und Petko Bogdanov. 2013. Graphing the grammar of motives in National Security Strategies: Cultural interpretation, automated text analysis and the drama of global politics. Poetics 41:670–700.

Mosteller, Frederick, und David L. Wallace. 1963. Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers. Journal of the American Statistical Association 58:275–309.

Nelson, Laura K. 2021. Leveraging the alignment between machine learning and intersectionality: Using word embeddings to measure intersectional experiences of the nineteenth century U.S. South. Poetics 88:1–14.

Nelson, Laura K., Derek Burk, Marcel Knudsen und Leslie McCall. 2018. The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods. Sociological Methods & Research doi: 0049124118769114.

Pennington, Jeffrey, Richard Socher und Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. Doha, Qatar: Association for Computational Linguistics http://aclweb.org/anthology/D14-1162 (Zugegriffen: 18. Mai 2022).

Perry, Patrick O., und Kenneth Benoit. 2017. Scaling Text with the Class Affinity Model. http://arxiv.org/abs/1710.08963 (Zugegriffen: 28. Nov. 2022).

Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton und Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. arXiv:2003.07082 [cs].

Roberts, Margaret E. et al. 2014. Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science 58:1064–1082.

Sandhaus, Evan. 2008. The New York Times Annotated Corpus. 3250585 KB. https://catalog.ldc.upenn.edu/LDC2008T19 (Zugegriffen: 31. Jan. 2022).

Stuhler, Oscar. 2022. Who Does What to Whom? Making Text Parsers Work for Sociological Inquiry. Sociological Methods & Research doi: 00491241221099551.

Trübner, Miriam, und Andreas Mühlichen. 2019. Big Data. In: Handbuch Methoden der empirischen Sozialforschung, vol. 1, Hrsg. Nina Baur und Jörg Blasius, 143–158. Wiesbaden: Springer.

Vaswani, Ashish et al. 2017. Attention Is All You Need. arXiv:1706.03762 [cs].

Whittaker, Meredith. 2021. The steep cost of capture. Interactions 28:50–55.

Downloads

Published

2023-09-29

Issue

Section

Sektion Methoden der empirischen Sozialforschung: Aktuelle Themen der empirischen Sozialforschung