We present a study of the relationship between gender, linguistic style, and social networks, us- ing a novel corpus of 14,000 users of Twitter. Prior quantitative work on gender often treats this social variable as a binary; we argue for a more nuanced approach. By clustering Twitter feeds, we find a range of styles and interests that reflects the multifaceted interaction between gender and language. Some styles mirror the aggregated language-gender statistics, while others contra- dict them. Next, we investigate individuals whose language better matches the other gender. We find that such individuals have social networks that include significantly more individuals from the other gender, and that in general, social network homophily is correlated with the use of same-gender language markers. Pairing computational methods and social theory thus offers a new perspective on how gender emerges as individuals position themselves relative to audiences, topics, and mainstream gender norms.
First, previous quantitative work has focused on the words that distinguish women and men sole- ly on the basis of their gender, attempting to find markers that uniquely characterize each group. This disregards strong theoretical arguments and qualitative evidence that gender can be enacted through a diversity of styles and stances. By clustering the authors in our dataset, we identify a range of different styles and topical interests. Many of these clusters have strong gender orienta- tions, but their use of linguistic resources sometimes directly conflicts with the aggregated language-gender statistics. We find that linguistic tendencies that have previously been attributed to women or men as undifferentiated social groups often describe only a subset of individuals; there are strongly gendered styles that use language resources in ways that are odds with the overall aggregated statistics.
Second, previous corpus-based work has had little to say about individuals whose linguistic styles defy aggregated language-gender patterns. To find these individuals, we build a classifier capable of determining the gender of microblog authors from their writing style, with an accuracy of 88%. We focus on the individuals that the classifier gets wrong, and examine their lan- guage in the context of their online social networks. We find a significant correlation between
the use of mainstream gendered language—as represented by classifier confidence—and social network gender homophily (how much a social network is made up of same-sex individuals). Individuals whose gender is classified incorrectly have social networks that are much less ho- mophilous than those of the individuals that the classifier gets right. While the average social network in our corpus displays significant homophily (63% of connections are same-gender), social network features provide no marginal improvement in the classifier performance. That is, social network gender homophily and the use of mainstream gendered linguistic features are closely linked, even after controlling for author gender, suggesting a root cause in the individu- al's relationship to mainstream gender norms and roles. We see these individuals not as statistical outliers, but as people who coherently “doing” gender in a way that influences both their linguis- tic choices and their social behavior.