Emeline Liu

   About       Resume       Archive       Feed   


Analyzing Fashion Trends with R

Background

The goal of this project at the outset was to try to read the consumer’s mind in a sense to try to predict changes in global trends, using TwitteR to pull tweets. However, the project has changed scope over the semester to also include webscraping of blogs and retail websites. At this point, I pulled information from Vogue Magazine’s Twitter, the blog Girl with Curves, and the retailers Forever 21, ASOS, and Land’s End.

Getting Data and Cleaning

Twitter

I pulled 1500 tweets from Vogue Magazine’s account for this round of visualization. The advice given at the last iteration of this project was to focus on a smaller amount of data and clean it more effectively. While my initial thought was that casting a wide net would give more information and thus more insight, that was not quite true. More information on Twitter seems to just give more noise rather than more useful information.

I decided on @voguemagazine because it’s a verified account and so would hopefully have less spam. For my presentation earlier, I used @fashionweekNYC’s Twitter, but I then realized that that is not a verified account and has no provable relationship with New York’s Fashion Week. In addition, because I’m looking at clothing stores rather than focusing on time-related fashion events, I figured that a fashion magazine might be more relevant.

Others

I decided to also look at other types of websites to see if there was any more interesting information that I could find. To do so, I used Hadley wickham’s rvest package to do some webscraping. In order to make sure I was pulling the right information, I used the selectorgadget extension for Chrome to determine what CSS fields I needed to use. I decided to take a look at one of my favorite blogs - named Girl with Curves, Forever21, and ASOS. I was also considering some other stores, but either selectorgadget didn’t work or they block webscraping. As mentioned in class, I wanted to look at Talbots’ website, but their URLs are inconveniently structured, and I couldn’t use selectorgadget to pull out the CSS fields. For some variety, I decided to look at Land’s End, which does have a more mature audience than Forever 21 and Asos.

For the stores, I specifically looked at their new arrivals section. I also applied different forms of stop word lists to each set of data as preliminary analysis showed discrepancies on which insignificant words popped up the most.

Databook

As this is all text-analysis, the predominant data is text. For the tweets, all of the text across all of the pulled tweets gets combined into a single term doument matrix. For the webscraping of online stores, rvest pulls each specific item title/description and holds it as a discrete instance.

I’ve included text document matrix information for each of the information sources, below:

## [1] "Vogue"
## <<TermDocumentMatrix (terms: 1800, documents: 1500)>>
## Non-/sparse entries: 10030/2689970
## Sparsity           : 100%
## Maximal term length: 47
## Weighting          : term frequency (tf)
## [1] "Girl with Curves"
## <<TermDocumentMatrix (terms: 321, documents: 36)>>
## Non-/sparse entries: 708/10848
## Sparsity           : 94%
## Maximal term length: 22
## Weighting          : term frequency (tf)
## [1] "F21 Women"
## <<TermDocumentMatrix (terms: 1052, documents: 3636)>>
## Non-/sparse entries: 10236/3814836
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## [1] "F21 Women+"
## <<TermDocumentMatrix (terms: 223, documents: 345)>>
## Non-/sparse entries: 820/76115
## Sparsity           : 99%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## [1] "F21 Men"
## <<TermDocumentMatrix (terms: 15, documents: 4)>>
## Non-/sparse entries: 15/45
## Sparsity           : 75%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## [1] "Asos"
## <<TermDocumentMatrix (terms: 133, documents: 360)>>
## Non-/sparse entries: 1990/45890
## Sparsity           : 96%
## Maximal term length: 16
## Weighting          : term frequency (tf)
## [1] "Land's End"
## <<TermDocumentMatrix (terms: 182, documents: 120)>>
## Non-/sparse entries: 637/21203
## Sparsity           : 97%
## Maximal term length: 11
## Weighting          : term frequency (tf)

Packages Used

TwitteR

TwitteR provides an interface to the Twitter API. The function I think I’ll be predominantly using is searchTwitter, which issues a search Twitter based on a supplied search string.

httr

httr is a package that allows for working with HTTP organized by HTTP verbs. This package facilitates working with Twitter.

plyr

plyr is a set of tools used for breaking down a big problem into manageable pieces, operating on each piece, and then putting the pieces back together.

tm

tm is the Text Mining Package. The main functions of the tm package allow for data import, corpus handling, preprocessing, metadata management, and create of term-document matrices on which further analysis can be performed.

ggplot2, graph, and RGraphviz

Graphing packages used for data visualization.

wordcloud

As the package name would suggest, this is used to generate word clouds for data visualization.

FactoMineR

Package for multivariate exploratory data anlysis and data mining. The capabilities of this package were used to create the correspondence word plot.

cluster

Also used to create to the correspondence world plot.

Frequency of Words

Looking at which words are the most frequent is one of the simplest ways to get a mental hold on what’s going on in the data. This section includes a list of the most common words and then plots a bar graph of words against frequency. Different sets of data also require different mininums on frequent words as they all vary in size. I set these cutoffs to display the top 10-15 most common words for each source.

@voguemagazine

Minimum value was set to 100. I thought it was interesting that Aziz Ansari, pizza, and the word “wanna” all appear approximately as often as YSL in the data. I’ll go into more detail on my speculation in the Associations section.

## [1] "best"       "dressesall" "gala"       "met"        "most"      
## [6] "rta"        "rtthe"      "time"       "your"

Girl with Curves

Minimum set to 10. As I mentioned during the presentation, the frequenct words used in this blog are surprisingly not interesting at all. Obviously a fashion blogger will often talk mention accessories or dresses or lipstick or shoes. It is interesting to see that Asos, Pandora, and Talbots all show up quite frequently.

I was initially pretty confused with what was going on with Stuart, macklickable, and weitzman, but I realized that Stuart Weitzman is a brand of shoes. Maclickable is actually a type of lipstick - MAC Lickable. So that was somewhat interesting at least. I think the problem with looking at a blog is that the most frequent terms are going to just illuminate what that one blogger says the most. On one hand, that might be intriguing to look at, but it doesn’t really align with the purpose of looking into larger global trends.

##  [1] "asos"       "bag"        "dress"      "lipstick"   "old"       
##  [6] "pandora"    "season"     "shoes"      "sunglasses" "top"       
## [11] "under"

Forever 21: Women

Minimum set to 80. Observations from the class are that Forever 21 seems to be much more descriptive in titling their items. There are quite a few common abjectives used, namely stripe, striped, floral, print, knit, floral, embroidered, and abstract.

##  [1] "abstract"    "bikini"      "cami"        "crop"        "denim"      
##  [6] "dress"       "embroidered" "floral"      "knit"        "lace"       
## [11] "longline"    "maxi"        "print"       "romper"      "shirt"      
## [16] "shorts"      "stripe"      "striped"     "tank"        "tee"        
## [21] "top"

Forever 21: Women+

Minimum set to 15. I wanted to include analysis on both the mainstream and plus-size clothing sections to see if there were any noticeable disparities. Where it look a frequency of over 100 to cut down to 10-15 terms for the regular sizing, it only takes 15 for the plus-size section. One interpretation of those results is that there are far more items available in the regular sizing section, which would make logical sense.

##  [1] "bikini"  "classic" "dress"   "floral"  "jeans"   "print"   "shorts" 
##  [8] "striped" "tee"     "top"     "tribal"

Forever 21: Men

Minimum set to 1. Apparently there are not very common words used to describe men’s clothing. There also aren’t very many items offered on the website, which is logical.

##  [1] "abstract"   "baseball"   "bow"        "drawstring" "geo"       
##  [6] "heathered"  "midcalf"    "pattern"    "print"      "rose"      
## [11] "socks"      "stripe"     "sweatpants" "tee"        "tie"

Asos

Minimum set to 30. As was noted in class, Asos seems to be much more reserved in their naming habits than Forever 21. I think it would be fair to say that Asos has a slightly older audience than Forever 21, and is also based in the UK rather than the US. This may account for the more conservative taste.

##  [1] "and"      "back"     "detail"   "dress"    "floral"   "hem"     
##  [7] "look"     "moda"     "print"    "shoulder" "top"      "vero"

Land’s End

Minimum set to 10. From my anecdotal experience, Land’s End caters more to selling classics and a more mature audience than either Forever 21 or Asos. I thought it was kind of interesting to see that swimsuit, coverup, and beach were all quite frequent words, where none of the other sites seem to mention those items. I also thought it was interesting to see that the single most common word was sleeve.

##  [1] "beach"      "coverup"    "dress"      "fit"        "pattern"   
##  [6] "shirt"      "short"      "sleeve"     "sleeveless" "stripe"    
## [11] "swim"       "swimsuit"   "top"        "tunic"

Associations

I thought the associations were the most interesting part of the data analysis because it actually conveys a surpring amount of information. The most commons types of prints or dresses can be determined by looking for the other words most associated with “print” or “dress.” It’s also interesting to note that this section, for me, was the most enlightening, yet the association lists include no data visualization. My thought from that is the human mind is still much more capable of digesting text than computers and statistics. In a similar vein, (most) computers still aren’t able to draw their own conclusions, which is definitely necessary for this analysis in this project.

@voguemagazine

After seeing the interesting word frequency results, I included the most common words from that part of the analysis here to see if the associations could shed some light. However, looking at the results, I think I’m even more confused. My only theory was that there was some event involving pizza and someone named Brandon. Again, it’s incredibly frustrating to look at Twitter results because I feel like it’s never quite signifiant or illuminating or interesting. Even with mindful cleaning, I think there is still the issue that people always will just post silly things to Twitter, especially in relation to fashion or clothing. People aren’t going to go to Twitter with serious or pedantic thoughts on clothing - even though that would make my text-mining job much easier.

## [1] "Associated with PIZZA"
##              pizza
## azizansari    1.00
## brandon       1.00
## hosting       1.00
## kidsmetball   1.00
## peep          1.00
## theme         1.00
## wintouryears  1.00
## ysl           1.00
## annual        0.97
## wanna         0.97
## [1] "Associated with WANNA"
## $wanna
## numeric(0)
## [1] "Associated with BRANDON"
##              brandon
## azizansari         1
## hosting            1
## kidsmetball        1
## peep               1
## pizza              1
## theme              1
## wintouryears       1
## ysl                1

Again, I always keep hoping that the next round of data visualization will shed some light onto what exactly is ever happening on Twitter. From this correlation plot, I’d say that there was some sort of event involving the new Avengers movie and Audi. And another event with Asis Ansari and YSL. However, this just goes to prove once again that Twitter is not really the best place to go for insight. The associations in this section make me feel that this Twitter, even though it’s the verified Vogue account, is mostly just full of advertisements.

Girl with Curves

As I mentioned above, looking at blogs shows more about the one person behind the blog than it doesn about what the consumer in general wants. Maybe a way to work around this would be to scrape information from a large number of blogs and perform text analysis on that entire body. In the meantime, I think these associations show more about the blogger’s writing style than it does about her actualy style. The blogger often mentions something about the last or current season and will usually talk about shoes and bags and lipstick all together. I was hoping that maybe the associations would illuminate what type of bag the blogger talks about the most, but no such luck.

## [1] "Associated with SEASON"
##         season
## last      0.98
## current   0.83
## [1] "Associated with BAG"
## bag.shoes 
##      0.93
## [1] "Associated with SHOES"
## shoes.bag 
##      0.93
## [1] "Associated with STUART"
##             stuart
## accessories   1.00
## maclickable   1.00
## talbots       1.00
## weitzman      1.00
## under         0.93

This is by far the weirdest correlation plot. It basically just looks like every keyword is correlated other than “season.” My thought on this phenomenom is just that this reflects what the blogger posts the most about, so they are going to appear to be correlated just because they appear frequently together in the same post.

Forever 21: Women

I thought these were some of the most interesting associations, which makes sense considering the abundance of descriptive words in the Forever 21 data. These associations show what kind of dress, top, tree, tank, or print are the most commonly offered by F21.

## [1] "Associated with DRESS"
##           dress
## maxi       0.38
## flare      0.20
## shift      0.20
## fit        0.17
## sheath     0.17
## bodycon    0.15
## midi       0.15
## babydoll   0.14
## tshirt     0.13
## combo      0.11
## print      0.11
## strapless  0.10
## [1] "Associated with TOP"
##                 top
## crop           0.41
## halter         0.19
## peasant        0.17
## offtheshoulder 0.15
## bikini         0.14
## triangle       0.14
## crochet        0.11
## embroidered    0.11
## flyaway        0.10
## [1] "Associated with TEE"
##           tee
## muscle   0.45
## graphic  0.22
## baseball 0.16
## pocket   0.16
## longline 0.15
## raglan   0.13
## eptm     0.12
## ringer   0.11
## slub     0.10
## [1] "Associated with TANK"
##              tank
## racerback    0.20
## athletic     0.16
## fringe       0.14
## stripepocket 0.11
## vent         0.11
## [1] "Associated with PRINT"
##              print
## tribal        0.32
## abstract      0.30
## southwestern  0.22
## rose          0.19
## floral        0.15
## diamond       0.14
## mandala       0.13
## paisley       0.13
## tile          0.13
## dress         0.11
## jumpsuit      0.10
## [1] "Associated with DRAWSTRING"
##              drawstring
## shorts             0.28
## art                0.12
## funnelpocket       0.12
## kangaroo           0.12
## linenblend         0.12
## pop                0.12
## trench             0.12
## waist              0.11

The clothing retailers have much more logical correlation plots. This plot is with a minimum frequency of 10, displaying 30 total keywords, and a correlation threshold of 0.1.

The plot gets a little more interesting by lowering the correlation threshold to 0.01. However, while this appears to show more correlations, those ties might not be very strong considering that the correlation threshold is so low. Therefore, it’s probably not reasonable to draw any conclusions from this graphic, but I wanted to include it to contrast with the higher threshold.

Forever 21: Women+

There are differences in the associations between the mainstream and plus-size departments, but I’m not sure if there’s anything signficant to comment on. Like with Land’s End, quite a few swimsuits appear to be mentioned, but I don’t really have any substantial theories on why that occurs. Apparently heathered tees and tribal print is popular across both sizing sections.

## [1] "Associated with DRESS"
##          dress
## maxi      0.47
## fit       0.28
## flare     0.28
## babydoll  0.23
## paisley   0.23
## [1] "Associated with TOP"
##         top
## back   0.22
## bikini 0.22
## panel  0.22
## [1] "Associated with TEE"
##            tee
## heathered 0.45
## pocket    0.31
## vneck     0.29
## barbie    0.28
## muscle    0.28
## destroyed 0.22
## [1] "Associated with TANK"
##            tank
## mesh       0.50
## burnout    0.35
## glow       0.35
## waldo      0.35
## wheres     0.35
## coverup    0.34
## graphic    0.25
## cutoutback 0.24
## racerback  0.24
## [1] "Associated with PRINT"
##          print
## tribal    0.52
## bikini    0.35
## tropical  0.31
## bottom    0.25

While I did include different options on the correlation plot for the mainstream and plus-size sections, I thought it was still interesting that they look so different. The plot also seems very logial to me - all of the jeans-related keywords are linked, as as the different prints.

Forever 21: Men

Apparently the baroque pattern is popular right now for men. I think that’s actually kind of striking because baroque is pretty decorative and intricate. I guess I am suprised this is a trend for men.

## [1] "Associated with TEE"
##           tee
## baseball    1
## heathered   1
## stripe      1
## [1] "Associated with PRINT"
##      print
## bow      1
## rose     1
## tie      1

I thought this plot was also surprisingly logical. All of the groupings are completely rational. Apparently baroque and rose print shirts are now a thing.

Asos

I’m really fascinated by how different F21 and Asos are. I tried to keep the same list of words to pull associations with, and the results are entirely disparate. Where F21 mentions maxi and flare dresses, Asos is about floral, midi, and printed dresses. In addition, the most common prints are bird, bloom, and bonded, not tribal or southwestern.

Asos, to me, has a more refined and mature style than F21, so it’s interesting to see that there are real differences between the two.

## [1] "Associated with DRESS"
##        dress
## moda    0.41
## vero    0.41
## floral  0.35
## [1] "Associated with TOP"
##             top
## cami       0.48
## cold       0.48
## crop       0.48
## cropped    0.48
## double     0.48
## jumbo      0.48
## laser      0.48
## layer      0.48
## leather    0.48
## neckline   0.48
## off        0.48
## ribtipped  0.48
## the        0.48
## shoulder   0.37
## cut        0.30
## mackintosh 0.30
## millie     0.30
## [1] "Associated with TANK"
##          tank
## elephant  1.0
## ord       1.0
## print     0.4
## [1] "Associated with PRINT"
##            print
## bird         0.4
## bonded       0.4
## dressblock   0.4
## dresscut     0.4
## elephant     0.4
## halter       0.4
## jersey       0.4
## love         0.4
## ord          0.4
## pineapple    0.4
## premium      0.4
## sides        0.4
## tank         0.4
## white        0.4

Land’s End

Again, there are visible differences between F21, Asos, and Land’s End. I think this is especially apparent when looking at the types of shirts: polo, pique, oxford, camp, collar. These types of shirts definitely give off a different vibe than maxi dresses or floral print.

##             sleeve
## short         0.68
## lightweight   0.35
## elbow         0.31
## supima        0.26
## vneck         0.26
## blouse        0.25
## camp          0.25
## neck          0.25
## oxford        0.25
## shirtdress    0.25
## sweetheart    0.25
##              top
## tankini     0.53
## swimsuit    0.50
## living      0.43
## beach       0.42
## scoop       0.41
## slub        0.36
## adjustable  0.34
## dresskini   0.33
## geo         0.33
## jersey      0.29
## tank        0.29
## embroidered 0.25
## ruffle      0.25
## scoopneck   0.25
##            shirt
## polo        0.61
## pique       0.49
## iron        0.44
## sleeveless  0.44
## camp        0.34
## collar      0.34
## oxford      0.34
## print       0.30
## long        0.26

Again, this is a pretty logical graph. I think it’s interesting that there’s so much focus on summer things like sleeveless shirts, coverups, and beaches. I wonder what Land’s End’s motivation behind their offerings are.

Wordclouds

I don’t have anything to mention specifically about each venue for this section. I am still fascinated by how much of a discernable difference there is across the different sources and stores. I don’t know if I have plausible interpretations of the differences, but it’s stil pretty thought-provoking.

@voguemagazine

Girl with Curves

Forever 21: Women

I thought there would be much more discerable differences between the the straight and plus-size sizing, but the word clouds prove otherwise. While I did have to change the minimum frequency from 40 for the mainstream sizing to 10 for plus size, the word clouds look almost identical. My conclusion is just that there are fewer plus-size options available. ##Forever 21: Women+

Forever 21: Men

Asos

Land’s End

Correspondence Word Plot

Correspondence word plots seem to work best when there’s a relatively large amount of data with distinct clusters/topics. I decided to include these plots for only Forever21 data because the others don’t seem to shed much light on the data. This is mostly just another way to visualize the clusters/associations within the data.

Forever 21: Women’s

Conclusions

My intension when I first decided on this project was to see if I could elucidate what the consumer wants to buy using social media cues. The fashion and clothing business is a multibillion dollar industry, and being able to predict what the consumer wants would create a huge edge for any company. However, being ahead of the curve with fashion is still difficult even with powerful data analysis tools.

I had hoped that people would flock to Twitter to talk about things they wanted to buy or outfits they were wearing lately. However, people were more likely to post pictures or links to other sites, rather than use up their 140 characters on descriptions. Even verified accounts with ties to various Fashion Weeks had Twitter streams full of insignificant chatter. Which is probably just the nature of the beast – Twitter is definitely a major venue for those more fleeting, insubstantial thoughts. I was hoping to be the person who could pull the clearer image out of all of the static, but I suppose someone else would have done it before me if it was that simple.

Even looking at a single blog didn’t really give the information I wanted. Obviously person’s blog is going to be a reflection of self, which is interesting to look at. However, I wanted to gather information on global trends rather than the personal style of one person. Perhaps I should have thought to look at multiple blogs and combine them into one corpus for text analysis, but then the results would be dependent on the blogs that I chose.

The most revealing sources to look at were the retailers themselves. While I did include my conjectures as to what was going on in the above sections, there is no way to draw concrete conclusions on consumer trends. The problem is that looking at the retailers doesn’t not necessarily reflect what the consumer actually wants, just what the retailer thinks will sell. The things being offered might also just be a result of the company buying a large quantity of something without much purpose and needing to sell all of it, which would make that item appear frequently. There’s also no simple way for me to access the data on what’s selling the most either. The scraped information is interesting to look at, such as the differences between stores, but isn’t quite as marketable in terms of being able to predict what might happen next or knowing what’s on the consumer’s mind.

Next steps that might be able to give more concrete results would be to pull text from a variety of blogs and see if there are any outstanding trending topics across the entire body of work. Another possible interesting source would be fashion magazines, but I don’t think they publish all of their content online. I also think that this sort of analysis from the inside of a retailer would be much more interesting as there would probably also be meta-data on purchases that could be incorporated.

In the end, I think this project was extremely educational and thought-provoking, even though it was based on something superficial and materialistic. While it was frustrating that the problems was mostly that the perfect body of data doesn’t exist, I assume that that is a realistic part of data analysis.