Machine learning reveals that news coverage of people in creative industries such as design and art is shaped by gender. Can it guide us toward parity?

How long would it take you to review half a million articles? Not just to read, but to tally for particular keywords, such as “he,” “she,” and the words that immediately follow them? Well, let’s just say you’d have to quit your day job.

Undeterred, the Creative Industries Policy and Evidence Centre, which provides independent research and policy recommendations for the U.K.’s creative industry, in partnership with the innovation foundation Nesta, made it their day job. They had some help: AI.


Using open-source data from the Guardian between 2000 and 2018, the center assessed how frequently keywords such as “he” and “she” appeared in sections of that publication that related to the creative fields: fashion, stage, media, books, and games. The study also looked at how they were represented, by evaluating the words that immediately followed those pronouns. What they found was that the representation of gender in industry coverage is for the most part, proportionate to the industry itself—showing both how far creative fields have come and how far they have to go.



Between 2000 and 2013, female pronouns made up less than a third of all gendered pronouns in the creative sections. Over the next four years, references pivoted up 10%, to 40% in 2018—3% more than the proportion of women in the U.K. creative industry itself. And while nonbinary “they/them” pronouns were not a part of the study, because the data set could be conflated with third-person plural pronouns, the use of the term “nonbinary” also saw a spike. While it was used only 100 times over the 18-year period, 50 of those mentions were in 2018.

In 2000, about a quarter of direct quotes in those sections were attributed to women. As of 2016, the proportion of direct quotes attributed to women is slightly higher than the percentage of female pronouns in those sections.



According to the study, words such as “‘directed,” “performed,” “painted,” and “designed,” as well as “managed,” “founded,” and “launched,” which indicate achievement and leadership, were more likely than other words to refer to men. The use of these words after “she” was about five percentage points less than when “she” and “he” were used as a data set together.

Words such as “sings,” “sang,” “dances,” and “danced” were all more likely to refer to women than men, indicating that certain creative activities are more greatly associated with one gender over another. The researchers saw this same trend play out with entire sections as well: The “she” pronoun was represented the most in fashion, at 52%, and the least in tech and games (26% and 25%, respectively). So while every creative section has made more space for women, the pace of growth depends on the topic itself.

In a time when the words “big data” strike fear in the hearts of many, this large quantity of data—all 500,000 articles—was combed through thanks to AI. The organization saw this information, which is newly available because of machine learning capabilities, as a reason to appreciate AI, not fear it. The visualizations created from the findings are now being considered for the Kantar Information Is Beautiful Awards.

And while the Centre was limited to the Guardian because it’s one of the few publications that makes its data publicly available through an open-access API, Dr. Cath Sleeman, who created the data visualizations and conducted the research, wrote that the limitation highlighted her feeling that “open data is the key ingredient to enabling more in-depth analysis of diversity,” and that we can “use big data and machine learning to generate more meaningful insights on gender inequality.”

But like anything else, there’s good and bad. Amazon used AI to screen candidates for job opportunities based on keywords in their résumés. However, the program was ultimately sunsetted after the company realized that the data set of résumés used to train the AI was largely made up of men, so the program learned to associate common keywords used by men as positive traits. Women with different keywords were left behind, and so too was this machine learning experiment.

“While big data studies can enrich diversity measures, there are two important sources of potential bias. First, we’re almost always inferring gender,” says Sleeman, “from a face, a first name, or a single pronoun—and so we may get a person’s gender wrong. Second, these inference methods typically only detect ‘male’ and ‘female,’ excluding or misclassifying anyone who identifies with a nonbinary gender.” For these reasons, Sleeman felt big-data methods still shouldn’t replace surveys, which allow people to self-identify or completely opt out.

Bias is an ever-present danger when automating new processes based on data sets of the past. But perhaps when AI is used in this context, to gather data that reveal bias patterns of days gone by, we can use it to leave those behaviors where they belong—in the past—rather than as instruction for the future.

Via Fastcompany.com