21st Century Linguistics

Security on the web has as much to do with the programmers writing code as it does with firewalls and virus protection. Linguistics Associate Professor Raúl Aranovich studies language structure and theory, and is working on a project for the National Science Foundation that could identify programmers most likely to write vulnerable code.

Last year, Aranovich won funding to lead a collaboration with UC Davis computer scientists P. T. Devanbu and V. Filkov on a project called "Language, Computation and Cybersecurity." It is an Early Concept Grants for Exploratory Research (EAGER) project funded by the National Science Foundation.

Aranovich is also part of a research team awarded funding in January from the Institute for Social Sciences in the UC Davis College of Letters and Science to expand our understanding of memory using music and brain scans.

Here, he describes his work on leading-edge linguistics at UC Davis.

Tell me about your NSF project.

We are looking at open-source software communities where developers collaborate online. Because all collaboration is online there’s a lot of language involved, and also a lot of code that’s being exchanged. We’re trying to see what the social dynamics of programmers are around their style for coding and their linguistic style.

It’s really unusual work. Nobody has done this kind of thing before. The interesting thing is that it has an application to cybersecurity. We think that once we identify these linguistic profiles within the group and we understand the group dynamics then we can find which programmers are more prone to writing vulnerable code.

There’s this big debate whether an author leaves a quantitative fingerprint on his or her work. It could be from things like average sentence length or how many adverbs you include in your writing or your speech. These measures have been used to identify authors, either of anonymous texts or authors who claim to have written some piece of text that is not even theirs, or even for forensics. A lot of people who are involved in this try to find signatures to unmask anonymous, dangerous users online.

What motivated you to pursue this project?

This turn toward more quantitative applications of linguistics, I see it as a trend happening in my discipline right now and I want to stay on top of things. I want to be doing 21st century linguistics and not 20th century linguistics. Also, there’s a lot of demand for this kind of knowledge and expertise in linguistics from other fields, from computer science, from psychology, from neuroscience, etc.

I believe in collaboration. It makes us all more productive. It’s more interesting to do research with your colleagues than alone. It helps us find new areas of inquiry. All of these possibilities plus my shifting interest have been fueling my interests in this kind of work.

What is the most surprising recent finding?

How little we know. Even though people claim to have found quantitative measures of similarity or identify for authors, they don’t pan out. It always depends on the sample, on who you’re comparing, whether the author has a style or not. Trying to find what measures actually work has become very interesting to me.

The other thing is studying things like sentence length. The statistical properties of how long our sentences are kind of weird. They don’t match any known distribution for natural phenomena. You cannot make them fit. I think we still do not understand the stochastic processes behind them, the probabilistic processes that are there for vocabulary selection, content that you want to express, memory limitations, complexity of the sentence structure itself. We don’t know how these features vary from situation to situation.

What is the biggest research opportunity in your field?

The biggest opportunity is in applying all of these new ideas to languages that are relatively understudied. A lot of people are going after that, either from psychology and computer sciences, to create these collections of texts and then running them through statistical natural language processing tools.

With English this is easy. With Spanish, yes. We have lots of resources. Finding and building resources for some of the languages I’m interested in — like Shona, the language from Zimbabwe, or Fijian, or some of the more obscure dialects from the Balkans — that’s what I would really like to be doing in the future. That’s what I would like my students to be doing in the future.

What are the opportunities for interdisciplinary collaborations in the future?

They are enormous. Language is everywhere. Everything touches on language in one way or another. The old ways of doing linguistics, which were very detailed, model-building, drawing a tree diagram for a few sentences, explaining some core problems like subject and object, question formation, these traditional areas of inquiry for linguistics, that didn’t generate much excitement outside of linguistics, except for maybe people doing psycholinguistics.

Once you move into this open-minded application of quantitative tools and statistical and analytical techniques through language then a lot of other people become excited.

At the [Institute for Social Sciences] conference, a colleague in sociology told me about doing work in counting keywords in political texts. I can talk about what to do next, how to move it to the next step. That’s what I’ve been doing.

It will happen more and more, where people with these kinds of skills will be able to help in other fields. In the tech industry more and more there is demand for people who can do text retrieval and text classification.

What other UC Davis researchers’ work are you most excited about?

My collaborator Prem Devanbu in computer science has been proposing a thesis about the naturalness of software, that computer languages are very similar to natural languages in their usage. Vladimir Filkov in computer science is also working with me and he does machine learning.

Duncan Temple Lang in statistics. Anything about data science, scraping data off the web for building corpora and getting more resources out of available huge amounts of data for linguistics.

Petr Janata for the way he looks at memory and language and the connection for what happens in the brain.

Also Fernanda Ferreira from psychology. She was just hired and she does very interesting work on syntax processing in psychology.

We have probably the largest community of people doing interdisciplinary work in linguistics of any of the UCs. We need to leverage that. That’s something we’re hoping to do in the next two to five years. We could become the world center for this kind of linguistics.

— Alex Russell