Lyrics! Trends, Analysis, and Graphics.

(Note: always click on the plots to see them properly)

People are good at understanding the meaning of complex (or simple) texts. They are also good at programming computers to do something else: to count and calculate. That’s what I did with artist lyrics.

I looked at 53 artists (mostly some of my favourites, some requests), 202 albums, 2428 songs, totalling 10386 minutes of music, and extracted their lyrics to try to develop ways of visualizing the data, and to find broad word usage patterns that would otherwise be hard to detect. It’s not a systematic study, it’s an exploration, and it’s based on examples and artists I like. Enjoy.

 

Word Density:

So let’s look at the artists. And let’s start by looking at which artists are more verbose. So we can count the number of words, and divide by the minutes of music. (click on the plot to view it properly please).

wordsDensity

First: there’s a huge variety in word density. About 10 fold difference between lil Wayne or Miley Cyrus (more than 100 words per minute) and Bonobo or Opeth (only about to 10 words per minute). The general trend makes sense: hip hop, and pop on the verbose side, and electronica and prog-rock/metal on the other side.

But, what words are they using? This is the main focus of all that follows.

Word frequency:

What’s interesting now is to start comparing content. In order to do that, the main concept here is going to be the frequency of a word in relation to the total amount of all other words used. A word-X frequency of 0.05 would mean that word represents 5% of the lyrics analyzed. Now for example here is the frequency of the word “sun”.

sun

The swedish duo The Knife, is in front of using “sun”, and The Beatles are also frequent mentioners of the glorious star (and it’s not just in their song “Here comes the sun” – although that is the most sun-dense song).

But one word at a time is not going to get us very far isn’t it?

Word Groups:

Perhaps more interesting than individual words, are word groups. The groups allow for a better understanding of general themes and include more information for each artist. Here’s an interesting question: who mentions more body parts?

body parts

Here the total height of the bar is the total frequency of all the group. And the different colors in each bar are the proportions of each body part.  Mastodon, Florence and the Machine, Moby, Bonobo, Opeth, and The Knife are the 6 artists who most frequently mention body parts (words include both the singular and plural forms). But they mention them with different balances. Moby for example repeats “body” very frequently. Florence on the other hand (…) mentions a wide diversity of body parts! Some artists mention specific body parts much more than any other: Jose Gonzalez for example uses “hands” very frequently (interestingly mostly when covering The Knife’s “Heartbeats”).
Jeff Buckley connoisseurs  will not be surprised to see “lip/s” be more strongly represented in his lyrics than any other band – except for Nirvana who repeatedly scream “kiss, kiss molly’s lips” in a Vaselines cover.

Another interesting group of words is to look at is colors!

Colors

 

Jeff Buckley shows up as the most colorful artist. Mostly because of how much he says “black, black, beauty” in “Mojo Pin”. Amy Winehouse is, perhaps unsurprisingly, mostly black and blue (as is Moby). Take a look at The Beatles, and you’ll find their yellow submarine; Martina Topley Bird’s “Baby blue” is also present.  Portishead, in 3 albums, only mention the color white, and Cat Power almost only mentions black (with a little spot of red).

Any George Carlin fans? In the 1970s he had a famous bit about seven dirty words you can’t say on TV. I decided to look them up in our artists. I had to remove some of them because they appeared nowhere. But you get the feeling.
Carlin

Is it any surprise that rap (lil Wayne, Das Racist) and heavy (Rage Against the Machine, Pantera) appear at the top? But do not generalize! Opeth are a metal band and they don’t swear.

I’ll end with Water Themes:

Water Themes

Puscifer and Morphine take the lead here. Two of my favourites.

I could go on with plots like these (and in fact I have many in reserve and will do some for you if you ask), but for now, something different.

 

Word Relations

What might be interesting to look at as well, are word relations. Are there words that tend to be mentioned together or to exclude each other?  Given the variability of of vocabulary and the fluidity of language, it’s going to be very hard to find nice straight linear relations as you get in basic physical laws, but we still get significant associations! Here are some examples I’ve chosen to show.

Apparently the word “heart” is inversely correlated with the word “head”.

headVSheart_spearman

It seems like the more you mention your head, the least likely you are to mention your heart (and vice versa). Pain of Salvation and Das Racist mention neither very much.

There are several other relations between words. But there is a particular trio that I like a lot. First: skin and flesh.

fleshVSskin_spearman

Not many artists mention flesh (that’s why you have a pile on the left side of the graph). But the general trend is clear: the more skin, the more flesh. And interestingly, it’s the metal bands – Opeth, Pain of Salvation, Mastodon, Nine Inch Nails, Pantera – who take the lead. But also Jeff Buckley – not surprisingly for anyone knowing his corporeal themes.

What’s interesting also is that flesh, for those who mention it, seems to incite cries:

fleshVScries_spearman

It’s Pain of Salvation, and Jeff Buckley there at the top right with a lot of dramatic language. The flesh is a powerful thing.

 

 “Me” & “You”

One thing that combines the last two analyses, is to look at the relationships not just between words, but actually between word groups. I picked out my favourite example: the relation between the group {me, mine, myself, my, i} and the group {you, your, yours, yourself}. This way we’ll find not only the artists who are most self-centered, or other-centered, but also we’ll learn about the existing relation between the two tendencies. Here’s the result:

me_mine_myself_my_iVSyou_your_yours_yourself_spearman

So there is a general trend whereby talking more about oneself, takes away from talking about another: the more me, the less you. Jose Gonzalez for example almost doesn’t mention himself, but is quite high on the “you” scale. An opposite example might be Moby whose lyrics apparently include a lot of “me” and very little of “you”.

Looking at one axis at a time is also instructive. For example, we can see that Fiona Apple, Nirvana and Amy Winehouse seem to take the lead in mentioning themselves. But when it comes to mentioning another, the champions are Martina Topley Bird, Dillinger Escape Plan, the Deftones and Cat Power. The two woman are in the good company of some heavy bands.

Thinking about the bottom left  and the top right corners is also interesting. On the the top right one-on-one personal relations abound, and on the bottom left, are artists who use very little personal vocabulary. No surprise to find Rage Against the Machine there – it’s about the people as a collective comrade!

 

 Comparing artists groups

Instead of doing something obvious (like comparing pop vs rock artist lyrics) I decided to ask a simple and more personal question: are there words that occur more frequently in the artists I most like? To answer it, I simply divided the 53 artists into categories (“Favourite” and “Less Favourite”), and then calculated statistically significant differences in word frequencies between the two groups (yes, I did a t-test). Here are 2 words that artists I like tend to use more:

boxplot3truthboxplot3embrace

And  two words that my favourite artists use less.

boxplot3wife boxplot3tv

 

It appears my artists (and me?) are not much into married life and sitting around watching tv, but instead care about truth and embraces. I agree – and it reminds me of this!

 Artist focus

Finally I wanted to know what makes each artist special and different from the others. We could do the traditional word clouds. Here are two examples.

Tooltool

and Radiohead

radiohead

This is not exactly what I want. They give you a feeling for which words are more common in each of the artist’s albums. What they do not do, however, is comparison among the artists. To understand what makes an artist different we need to compare the frequency of word-X in that artist with the same word-X in all the other artists. That’s what I did. (For the analytical minds out there: I computed the z-score distribution for each word frequency and extracted anything greater than 5). So here are words that Tool and Radiohead use more frequently than the others:

Zscore_toolZscore_radiohead

 

Let’s also see what words make Pink Floyd and Cat Power special:

Zscore_pinkFloydZscore_catPower

 

General patterns are hard to spot, but it’s interesting to see all the words with mathematical connotations in Tool: calculated, forty, third, divide, spiral, union. There’s also a prevalence of vulnerable or darker language: insecure, satan, withering, drags, worthless, widow, crawled.

Cat Power seems geared towards affective language that others dare not mention: marry, romance, kissing. But also travels and sights: manhattan, rome, daydream, mexico, wilderness, waterfall.

Other than broader patterns you can clearly identify specific songs or themes: for example there’s Pink Floyd’s crazy diamond, their suicidal “Waiting for the Worms”, and the Wall (as well as bricks) is clearly uniquely represented. I’ll leave Radiohead up to you.


Thank you to Luisa for some suggestions. If you want to see specific plots let me know, I can easily do them. If you want more artists included, let’s find the lyrics and they’ll be here in version 2.

Peace.

 


 

 

A request from my friend Ana (quoted here): who says more “la” (as in “la la la” or other similar expressions – lah, laaa, laaah, etc – all condensed into one)? Admittedly this really depends on the transcribing of the lyrics. But we’ll still learn from the plot:

la

Oh brit-pop …. 😉

 

 

 

 

 

 

Malcolm Gladwell’s fundamental attribution error – On external factors of academic/professional success.

In a recent talk at google, Malcolm Gladwell severely over-interprets data about post-academic achievement, proposing palpably unjustified conclusions.

The data and trend Gladwell picked up on is interesting. In summary: after leaving university, very capable students who were in the middle of their class in top world universities had worse professional outcomes than less capable students who were first in their class in non-top universities. In other words: there is some disadvantage in being the less good among the best possible group, when compared to being the best in a mediocre group.

Gladwell proceeds to propose the reason is that top-of-class students become more motivated and self confident even when the class is overall bad, while middle-of-class students lose motivation and self confidence. On this account, the reason for professional outcome disparity is intrinsic to the student. It’s motivation and self confidence resulting in more hard work. Gladwell goes so far as to suggest that it would be irrational to hire based on any absolute ranking, but instead hiring should be based on relative rankings. After all, the top-in-class will be more motivated and perform better.

Out of Gladwell’s account are factors extrinsinc to the students. In doing so he commits a form of the fundamental attribution error: a well-known bias that leads observers to justify people’s behaviour based on their attitudes and dispositions as opposed to circumstances and external factors. Another plausible justification for the disparities of outcomes is that top-of-class students (of non-top Univs), when compared to middle-of-class students (of top Univs) are: given more local awards, honours, distinctions;  given more access to preferential treatment by those who can open doors for them (professors, friends);  given more opportunities to express themselves and showcase their work to the outside; written stronger more supportive recommendation letters; perceived by the outside as more capable because of their local-rank; etc. All of these are perfectly plausible explanations for the data Gladwell shows. (Indeed, more plausible in my opinion.) And they have very different consequences if true.

Just imagine an example: A top-of-class student in “Bad University” will get the best recommendation letter that his/her professor wrote that year, while a middle-of-class student at Harvard will get nothing much from any of his/her professors. These are extrinsic reasons, and they say nothing about the student’s actual capacity, while they will seriously affect his/her career. In other words, Gladwell has no reason to exclude the simple possibility that  a middle-of-class student at “Top University” if given the same opportunities and support  would do better than a top-of-class at “Bad University”.

To conclude with a more general observation: structural and institutional factors are often disregarded as explanatory reasons for professional success (or personal success for that matter). Recognizing this leads to very different conclusions, and it puts a very different weight on us to change – as opposed to the “motivation” of students or workers. If we are all wasting the talent of very capable people just because they don’t stand out in their very capable group, then we need to find ways of paying less attention to rankings, and more attention to the actual capacity of each individual.