Art Auction Data: Exploratory Data Analysis
A series of posts related to an art auction price model project.
Art Auction Data: Exploratory Data Analysis
I’m working towards an ML project that models painting prices in the secondary art market based on a variety of artwork features.
I’ve finished some of the heavy (and not-so-heavy) lifting on the front end–writing a script to scrape the data, and pre-processing and cleaning the scraped data, which contains 53,034 auction result records for some 141 artists.
Now it’s time to embark on some good, old-fashioned EDA so I can get a better sense of what we’re working with here and gain a little intuition about how realized auction price may or may not correlate with some of the features that I’ve either scraped or engineered.
The Dataset and its Features
Here’s a sample of what the dataset looks like:
data.sample(10, random_state=123)
artist_name | title | date | medium | dims | auction_date | auction_house | auction_sale | auction_lot | price_realized | ... | auction_year | price_realized_USD_constant_2022 | area_cm_sq | volume_cm_cu | living | years_after_death_of_auction | artist_age_at_auction | artist_age_at_artwork_completion | artwork_age_at_auction | years_ago_of_auction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
31937 | Salvador Dali | The persistence of memory ,\nThe persistence o... | NaN | tapestry | 162.56 x 140.97 cm | Aug 12, 2017 | Michaan's Auctions • Alameda | August Estate Auction | Lot420 | US\$750 | ... | 2017 | 8.954441e+02 | 22916.083200 | NaN | 0 | 28.0 | NaN | NaN | NaN | 6 |
7514 | Gerhard Richter | ABSTRAKTES BILD 802-3 | NaN | oil on canvas | 112 by 102 cm | Sep 30, 2018 | Sotheby's | NaN | Lot1079 | HK\$27,720,000 • US\$3,540,939 | ... | 2018 | 4.126820e+06 | 11424.000000 | NaN | 1 | NaN | 86.0 | NaN | NaN | 5 |
18591 | Ed Ruscha | Regal | 2001 | dry pigment and acrylic on museum board | 101.9 x 152.4 cm | May 16, 2019 | Christie's • New York | Post-War and Contemporary Art Morning Session | Lot635 | US\$927,000 | ... | 2019 | 1.061153e+06 | 15529.560000 | NaN | 1 | NaN | 82.0 | 64.0 | 18.0 | 4 |
39383 | Takashi Murakami | Superflat Monogram | 2003 | acrylic on canvas mounted on board | 178.31 x 178.31 in | May 15, 2008 | Phillips de Pury & Company• New York | Contemporary Art Part I | Lot110 | US\$724,200 | ... | 2008 | 9.843836e+05 | 205125.112975 | NaN | 1 | NaN | 46.0 | 41.0 | 5.0 | 15 |
35906 | Bernard Buffet | L'atelier | 1949 | oil on canvas | 66.04 x 91.44 in | May 1, 1996 | Christie's • New York | Impressionist & Modern Paintings, Drawings & S... | Lot246 | US\$38,000 | ... | 1996 | 7.087884e+04 | 38959.261436 | NaN | 1 | NaN | 68.0 | 21.0 | 47.0 | 27 |
21611 | Chu Teh-Chun | Mirage èa l'aube (Mirage at Dawn) | 1920-2014 | oil on canvas | 130 x 195 cm | May 30, 2015 | Christie's • Hong Kong, HKCEC Grand Hall | Asian 20th Century & Contemporary Art (Evening... | Lot60 | HK\$15,640,000 • US\$2,026,991 | ... | 2015 | 2.502812e+06 | 25350.000000 | NaN | 0 | 1.0 | NaN | NaN | NaN | 8 |
25757 | Keith Haring | Untitled | 1983 | Oil on panel | 88.9 x 200.03 in | Dec 8, 1998 | Binoche • Paris | Binoche | Lot42 | NaN | ... | 1998 | NaN | 114726.654417 | NaN | 0 | 8.0 | NaN | 25.0 | 15.0 | 25 |
3388 | Andy Warhol | Key Service (Positive) | 1986 | synthetic polymer and silkscreen ink on canvas | 50.8 x 40.6 cm | Nov 12, 2012 | Christie's | Andy Warhol at Christie's Sold to Benefit the ... | Lot292 | US\$62,500 | ... | 2012 | 7.966644e+04 | 2062.480000 | NaN | 0 | 25.0 | NaN | 58.0 | 26.0 | 11 |
20841 | Pierre-Auguste Renoir | Paysage aux collettes | NaN | oil on canvas | 45.72 x 30.48 in | May 28, 1997 | Bukowskis• Stockholm | International Auction | Lot303 | SEK695,000 • US\$90,541 | ... | 1997 | 1.650921e+05 | 8990.598793 | NaN | 0 | 78.0 | NaN | NaN | NaN | 26 |
48705 | Julian Schnabel | Untitled | 1981 | ink, sand and gesso on torn paper | 96.52 x 124.46 in | Oct 14, 1998 | Christie's • Los Angeles | 20th Century & Contemporary Art | Lot100 | US\$17,000 | ... | 1998 | 3.052230e+04 | 77502.291447 | NaN | 1 | NaN | 47.0 | 30.0 | 17.0 | 25 |
10 rows × 45 columns
And here’s some background on the features it includes:
Scraped Features
Artwork Information
-artist_name
: The artist’s name, as it appears in the original list of artist names input into the scraping script
title
: The artwork’s titlemedium
: The artwork’s mediumdate
: The artwork’s attributed date, which in some cases is a span (e.g., 1956-1958) or an estimate (e.g., 1920s, circa 1940s-1950s, 16th century, etc.)dims
: Artwork’s attributed dimensions. Because these are paintings, in most cases there are two measurements for width and height, but in some cases objects include a depth measurement or a radius measurment (for circular works).
Auction Information
auction_date
: Date of auction, inMonth DD, YYYY
formatauction_house
: Name of auction house (e.g., Sotheby’s)auction_sale
: Name of sale (e.g., Contemporary Evening Sale)auction_lot
: Number of auction lotprice_realized
: Realized price in nominal currency. Includes transaction currency and, if not USD, conversion to USDestimate
: Range of auction house estimate for the work.bought_in
: Whether or not work was bought in (i.e., artwork goes unsold)
Merged Features
The following features are merged from the Museum of Modern Art’s collection dataset:
Nationality
: The artist’s nationalityGender
: The artist’s genderbirth_year
: Year of the artist’s birthdeath_year
: Year of the artist’s death (when applicable)
Parsed Features
Dates
auction_date_parsed
: Conversion ofdate
field to DateTime objectstart_date
: Year in which artwork was begun (identical toend_date
in cases wheredate
is a single year)end_date
: Year in which artwork was completed (identical tostart_date
in cases wheredate
is a single year)
Dimensions
dims_cm
: Extraction fromdims
of measurements denominated in cmdims_mm
: Extraction fromdims
of measurements denominated in mmdims_in
: Extraction fromdims
of measurements denominated in inis_diameter
: Boolean for whether a given measurement is indicated to be a diameterwidth_cm
: Width measurement extracted fromdims_cm
or computed fromdims_mm
ordims_in
height_cm
: Height measurement extracted fromdims_cm
or computed fromdims_mm
ordims_in
depth_cm
: Depth measurement extracted fromdims_cm
or computed fromdims_mm
ordims_in
width_mm
: Width measurement extracted fromdims_mm
height_mm
: Height measurement extracted fromdims_mm
depth_mm
: Depth measurement extracted fromdims_mm
width_in
: Width measurement extracted fromdims_in
height_in
: Height measurement extracted fromdims_in
depth_in
: Depth measurement extracted fromdims_in
Auction Information
auction_house_loc
: Location of auction (when applicable), as extracted fromauction_house
auction_house_name
: Name of auction house, extracted fromauction_house
price_realized_USD
: Nominal USD realized price, extracted fromprice_realized
auction_year
: Year, reformatted fromauction_date
Engineered Features
Auction Information
price_realized_USD_constant_2022
: Conversion ofprice_realized_USD
to constant 2022 dollars usingcpi
library
Artwork
area_cm_sq
: Artwork size as surface area, computed fromwidth_cm
andheight_cm
(orwidth_cm
if a diamter measurement)volume_cm_cu
: Artwork size as volume for three-dimensional works, computer fromwidth_cm
,height_cm
, anddepth_cm
Artist
living
: Boolean for whether an artist was living at the time of auctionyears_after_death_of_auction
: Number of years after artist’s death that the auction occurred (in cases when the artist was no longer alive at the time of auction)artist_age_at_auction
: Artist’s age at the time of auction (in cases where artist was living at the time of auction)artist_age_at_artwork_completion
: Artist’s age at the time of artwork’s completion. Proxy for stage of artist’s career.artwork_age_at_auction
: Age of artwork in years at time of auctionyears_ago_of_auction
: Years elapsed from auction to present
Note that I won’t be working with most of the raw, scraped features.
A Note on Methodology: Constant vs. Nominal Dollars
In most of what follows, I’ve decided to do preliminary data analysis for patterns and trends using constant 2022 dollars rather than nominal dollars from each observation’s given auction year. My reason for doing this is to eliminate the inflation variable as much as possible so that we can attempt to measure realized price accoring to a single standard. Otherwise, any attempt to look for correlations between a certain variable and price realized would be confounded by auction date. For instance, consider an artwork sold in 1989 for a relatively high price and an artwork solid in 2020 for a relatively low price: due to inflaction, these two prices might be the same, and we will have lost the ability to see their difference. We want to eliminate this possibility to the extent that we can.
Not always, though. Ultimately I do want my model to predict prices in nominal amounts–that is, I want the model to predict the price for a work sold in 1990 in nominal 1990 dollars. But again, my sense is that I’ll have an easier time understanding general trends and patterns in the data if I adjust for inflation. As a result, I’ll use the price_realized_USD_constant_2022
feature that I engineered so that I’m dealing with constant 2022 USD amounts.
Takeaways
1. Price Correlates Strongly with Artist Name
Based on some limited domain experience, my first intuition is that artist name will be the single most important factor in determining price. Which makes sense: Warhol will fall into one price bracket, while new MFAs will fall into another price bracket.
Let’s take a look at the artists most represented in this dataset by auction count, and then compare their realized price distributions (using constant 2022 dollars):
The first thing to note here is that evidently auction results are not distributed normally–not be a long shot. The realized price distribution for all the artists here, Warhol and Picasso in particular, have an aggressive positive skew, with a huge number of outliers–including (I was shocked to discover) a Warhol work that sold for close to $200M.
Let’s check again without the fliers.
Once we get rid of the outliers, we can see more clearly just how much variance there is from one artist market to the next.
2. Individual Artist Markets Vary
Another way we can examine this question of individual artist markets is to look at whether the correlations between realized price and certain features–painting size, for instance, or painting age at the time of auction–behave differently according to artist. In other words, perhaps for some artists size correlates strongly with realized price while for others it may not. Again, in order to resolve the inflation issue (since we’re looking at auction results from a nearly 40-year period), I’ll use constant 2022 dollars as the target variable.
Interestingly, we can see that, for an artist like Damien Hirst, size (width in particular) correlates relatively strongly with realized price, while for an artist like Bernard Buffet or Sam Francis, the correlation is much less pronounced. We can also see that for an artist like Zao Wou-Ki, realized price increases with the artist’s age, whereas for Francis or Buffet, the opposite is true.
While the artist’s name can of course be included in the model as a feature, to keep things simple for starters my approach will be to try to model an individual artist first. Intuitively this feels especially important since some features are correlated positively for certain artists and negatively for others.
3. Prices are Logarithmic
Because the realized price for artworks has such an aggressively positive skew, it turns out looking at the log of realized price effectively normalizes the distribution.
4. Artwork Size is Logarithmic, Too
Artwork size (width, height, and area) has a similar positive skew which can be remedied with a logarithmic scale.
Compare that with the same distributions on logarithmic scales.
5. Artwork Size and Price Have a Moderate Posive Correlation
Knowing that artwork dimensions and price need to be plotted on logarithmic scales, let’s see if there’s any meaningful correlation between the two.
Generally, yes, it appears there is some positive correlation between size and price realized.
6. Realized Price Varies by Artist Nationality
How does realized price vary with artist nationality?
There do seem to be some differences here, but because I intend to create models for each artist, this feature won’t really matter in the end. But still interesting to see!
7. Realized Price Doesn’t Vary Much by Gender
How does realized price vary by gender? First let’s check to see how many women artists this dataset contains:
# Count number of artists for each nationality
cols=['Gender', 'artist_name']
data.loc[~data.duplicated(subset=cols), cols]['Gender'].value_counts()
Male 129
Female 12
Name: Gender, dtype: int64
Not a huge sample, unfortunately, but let’s see.
There is some difference here, but the median realized price is quite close for men and women. Prices for male artists, however, have much more variability as the lower chart shows.
Like Nationality
, this feature won’t really come into play since I’ll be making artist-specific models.
8. Realized Price Varies by Artist’s Generation
How does realized price vary by artist generation? To do this, I’ll divide artists into decades by their birth year. For artists born prior to 1800, of which there are a couple in this dataset, I’ll lump them into a ‘pre-1800’ category.
I’m not sure how useful this information is, since the differences we see can easily be attributed to the artists and the specifics of their markets. For instance, it turns out that, in this dataset, there’s only one artist who was born in the 1850s, and that’s Van Gogh, who evidently fetches consistently high prices. But as with Gender
and Nationality
, this feature won’t be a concern of mine when building artist-specific models.
9. Realized Price and Artwork Date are Negatively Correlated (Older Works Sell for More)
How does price correlate with an artwork’s completion date? Are certain periods of art production more valuable than others? There are a few works in this dataset from prior to 1800–I’ll do without those so we can focus on work made from ~1850 to present, which is where the bulk of our data is.
Here we see a slight negative correlation between artwork year and price, indicating a value premium put on older works vs. newer ones–makes sense.
10. Realized Price and Auction Year are Positively Correlated (Artist Markets Accrue in Value)
How does price correlate with auction year? Because we’re using constant 2022 dollars, any changes we see should be a function not of inflation but of value increasing over time.
As expected, here we can see a slight positive correlation between auction year and realized price, suggesting, again, that artist values are increasing over time in aggregate.
11. Realized Price Varies by Auction House
Do different auction houses correlate with different price ranges? To look into this, I’ll reduce the cardinality of the auction_house_name
feature so that we’re looking at the main players and a catch-all category for everyone else.
There are clear differences here, it seems, so the auction house seems like it will be a valuable predictor of price. But I’ll want to reduce the cardinality, as I have above, for each individual artist market, since not all artists will have this same proportion of auction house representation.
12. Realized Price Varies by Auction Location
We have some data for auction location in this dataset. Let’s see if that has any bearing on price.
Here, too, we can see important trends, since certain locations correlate with higher or lower prices.
13. Dead Artists Fetch Higher Prices than Living Artists (but it’s complicated)
How does whether or not an artist is living at the time of auction affect its price? It feels rather obvious to me that prices will go up after an artist is no longer living–not only because there is no more work being created, but also because this implies that the artwork itself is older, which we’ve seen correlates positively with price.
No surprises here.
I am curious, though, if there are trends when we examine prices as a function of how many years before or after an artist’s death the auction took place:
This is interesting, since it helps us see that median price does rise during an artist’s lifetime. For some unexpected reason, there is a precipitous drop in realized price immediately following an artist’s death–my suspicion is that collectors aren’t selling so much and, if they are, not major works. And within about 25 years, median prices have recovered and continue to rise.
Here’s another way of considering this:
What’s interesting to note here is that prices generally seem to rise more quickly over the course of an artist’s lifetime than they do after his/her death.
14. Artist Age at Auction and Price Realized are Positively Correlated
No surprises here. As an artist ages, auction prices go up, which makes sense since the artist’s legacy is that much more secure in addition to the fact that his/her oeuvre is accruing value over time, independent of inflation, which we’ve already seen.
15. Realized Price and Artist Age at Artwork Completion are Mostly Uncorrelated
What about how an artist’s age at the time a given artwork was completed correlates with realized price?
I don’t see any meaningful correlation here really, but my intuition is that this may be correlated, negatively or positively, for different artists where the market favors, for instance, early career work or late career work, etc.
16. Realized Price and Artwork Age at Auction are Postively Correlated
And this, too, looks like what we’d expect: Older artworks fetch higher prices.