The Problem The sync licensing industry faces a fundamental information asymmetry problem. With hundreds of production music libraries operating globally, producers struggle to identify which companies are actively placing their style of music. Jesse Josefsson,...
Can Meta’s audio aesthetic model actually rate the quality of music?
Last year, Meta released Audiobox Aesthetics (AES), a research model that proposes scoring audio based on how people would rate it. The model outputs four scores: Production Quality (PQ), Production Complexity (PC), Content Enjoyment (CE), and Content Usefulness (CU).
The study suggests that audio aesthetics can be broken into these axes, and that a reference-free model can predict these scores directly from audio. If that holds, the scores could start informing decisions and become signals people lean on when judging music at scale.
I took a closer look to understand how the model frames aesthetic judgment and what this means in practice. I ran Audiobox Aesthetics myself and examined how its scores behave with real music.
What Meta’s Audiobox Aesthetics paper claims
Before jumping into my evaluation, let’s take a closer look at what Meta’s Audiobox Aesthetics paper set out to do.
The paper introduces a research model intended to automate how audio is evaluated when no reference version exists. The authors present this as a way to automate listening judgments. They describe human evaluations as costly and inconsistent, leading them to seek an automated alternative.
To address this need, the authors propose breaking audio evaluation into four separate axes and predicting a separate score for each:
- Production Quality (PQ) looks at technical execution, focusing on clarity and fidelity, dynamics, frequency balance, and spatialization.
- Production Complexity (PC) reflects how many sound elements are present in the audio.
- Content Enjoyment (CE) reflects how much listeners enjoy the audio, including their perception of artistic skill and overall listening experience.
- Content Usefulness (CU) considers whether the audio feels usable for creating content.
The model is trained using ratings from human listeners who follow the same guidelines across speech, music, and sound effects. It analyzes audio in short segments of around 10 seconds. For longer tracks, the model scores each segment independently and provides an average.
Beyond the audio itself, the model has no additional context. It does not know how a track is meant to be used or how it relates to other music. According to the paper, the scores tend to align with human ratings and could help sort audio when it’s not possible to listen to it all. In that way, the model is presented as a proxy for listener judgment.
Why I decided to evaluate the model
I wasn’t the only one who was curious to look into this model. Jeffrey Anthony’s “Can AI Measure Beauty? A Deep Dive into Meta’s Audio Aesthetics Model,” for instance, offers a deep, philosophical examination of what it means to quantify aesthetic judgment, including questions of ontology and judgment. I decided to tackle the topic even more with a hands-on approach, testing the model on some real-world examples to understand whether we could find some interesting patterns in the model’s predictions.
What caught my attention most was how these scores are meant to be used. Once aesthetic judgments are turned into numbers, they start to feel reliable. They look like something you can sort by, filter on, or use to decide what gets heard and what gets ignored.
This matters in music workflows. Scores like these could influence how catalogs are cleaned up, how tracks are ranked for sync, and how large libraries of music are evaluated without listening. With a skeptical but open mindset, I set out to discover how these scores behave with real-world data.
What I found when testing the model
A) Individual-track sanity checks
I began with a qualitative sanity check using individual songs whose perceptual differences are unambiguous to human listeners. The tracks I selected represent distinct production conditions, stylistic intentions, and levels of artistic ambition.
I included four songs:
- “Giorgio by Moroder” by Daft Punk as a high-end studio reference recording
- An extremely low-quality recording of “Funky Town” as a deliberately low-fidelity MP3 version of a pop track
- “The Schumacher Song” by DJ Visage as a late-1990s pop-trance track of questionable artistic depth
- “Blue Calx” by Aphex Twin as an experimental electronic composition with sparse textures and intricate sound design
The motivation for this test was straightforward. A model claiming to predict Production Quality should assign a lower PQ to “Funky Town” (low-quality MP3) than to “Giorgio by Moroder.” A model claiming to estimate production or musical complexity should recognize “Blue Calx” by Aphex Twin as more complex than formulaic late-90s pop-trance such as DJ Visage’s “Schumacher Song.” Likewise, enjoyment and usefulness scores should not collapse across experimental electronic music, audiophile-grade disco-funk, old-school pop-trance, and degraded consumer audio.
You can see that the resulting scores, shown in the individual-track comparison plot above, contradict these expectations. “Funky Town” receives a PQ score only slightly lower than “Giorgio by Moroder,” indicating near insensitivity to codec degradation and mastering fidelity. Even more strikingly, “Blue Calx” is assigned the lowest Production Complexity among the four tracks, while “The Schumacher Song” and “Funky Town” receive higher PC scores. This directly inverts what most listeners would consider to be structural or compositional complexity.
Content Enjoyment is highest for “Funky Town” and lowest for “Blue Calx,” suggesting that the CE dimension aligns more closely with catchiness or familiarity than with artistic merit or aesthetic depth.
Taken together, these results indicate that AES is largely insensitive to audio fidelity. It fails to reflect musical or structural complexity, and instead appears to reward constant spectral activity and conventional pop characteristics. Even at the individual track level, the semantics of Production Quality and Production Complexity don’t match their labels.
B) Artist-level distribution analysis
Next, I tested whether AES produces distinct aesthetic profiles for artists with musical identities, production aesthetics, and historical contexts that are clearly different. I analyzed distributions of Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness for Johann Sebastian Bach, Skrillex, Dream Theater, The Clash, and Hans Zimmer.
If AES captures musically meaningful aesthetics, we would expect to see systematic separation between these artists. For example, Hans Zimmer and Dream Theater might have a higher complexity score than The Clash. Skrillex’s modern electronic productions might have a higher quality score than early punk recordings. Bach’s works might show high complexity but variable enjoyment or usefulness depending on the recording and interpretation.
Instead, the plotted distributions show strong overlap across artists for CE, CU, and PQ, with only minor shifts in means. Most scores cluster tightly within a narrow band between approximately 7 and 8, regardless of artist. PC exhibits slightly more variation, but still fails to form clear stylistic groupings. Bach, Skrillex, Dream Theater, and Hans Zimmer largely occupy overlapping regions, while The Clash is not consistently separate.
This suggests that AES doesn’t meaningfully encode artist-level aesthetic or production differences. Despite extreme stylistic diversity, the model assigns broadly similar aesthetic profiles, reinforcing the interpretation that AES functions as a coarse estimator of acceptability or pleasantness rather than a representation of musical aesthetics.
C) Bias analysis using a balanced gender-controlled dataset
Scoring models are designed to rank, filter, and curate songs in large music catalogs. If these models encode demographic-correlated priors, they can silently amplify existing biases at scale. To test this risk, I analyzed whether AES exhibits systematic differences between tracks with female lead vocals and tracks without female lead vocals.
In our 2025 ISMIR paper, we showed that common music embedding models pick up non-musical singer traits, such as gender and language, and exhibit significant bias as a result. Because AES is intended to judge quality, aesthetics, and usefulness, it would be particularly problematic if it had similar biases. They could directly influence which music is considered “better” or more desirable.
I constructed a balanced dataset using the same methodology used in our 2025 paper, equalizing genre distribution and singer language across groups.
For each group, I computed score distributions for Content Enjoyment, Content Usefulness, Production Complexity, and Production Quality, visualized them, and performed statistical testing using Welch’s t-test alongside Cohen’s d effect sizes. For context, Welch’s t-test is a statistical test that compares whether the average scores between two groups are significantly different. Cohen’s d is a measure of effect size that quantifies how large that difference is in standardized units.
The results show consistent upward shifts for female-led tracks in CE, CU, and PQ. All three differences are statistically significant with small-to-moderate effect sizes. In contrast, there is virtually no difference in Production Complexity score between groups.
This pattern indicates that the model systematically assigns higher enjoyment, usefulness, and quality scores to material with female vocals, even under controlled conditions. Because complexity remains unaffected, the effect doesn’t appear to stem from structural musical differences. Instead, it likely reflects correlations in training data and human annotations, or the model treating certain vocal timbres and production styles associated with female vocals as implicit quality indicators.
These findings suggest that AES encodes demographic-correlated aesthetic priors, which is problematic for a model intended to judge musical quality, aesthetics, and usefulness.
When a measure becomes a target, it ceases to be a good measure.
Why this matters for the industry
Economist Charles Goodhart famously observed that “when a measure becomes a target, it ceases to be a good measure.” He was describing what happens when a metric starts to drive decisions rather than just being an indicator. Once a number is relied on, it begins to shape how people think and choose.
That idea applies directly to aesthetic scoring. A score, once it exists, carries weight. It gets used as a shortcut in decisions, even when its meaning is incomplete. This matters in music workflows because aesthetic judgment depends on context and purpose.
When a simplified score is treated as reliable, systems can start favoring what scores well rather than what actually sounds better or serves a creative goal. Over time, that can quietly steer decisions away from how audio is perceived and used in practice.
How we approach audio intelligence at Cyanite
At Cyanite, music isn’t judged in a vacuum, and neither are the decisions built on top of it. That’s why we don’t rely on single aesthetic scores. Instead, we focus on making audio describable and searchable in ways that stay transparent and grounded in context.
Aesthetic scoring can give the illusion of precision, but it often lumps together different technical qualities, genres, and styles. In music search and discovery, a single score doesn’t explain why a track is surfaced or excluded. That reasoning matters to us. Not to decide what’s “good,” but to give teams tools they can understand and trust.
We see audio intelligence as a way to expose structure, not replace judgment. Our systems surface identifiable musical attributes and relationships, knowing that the same track can be the right or wrong fit depending on how it’s used. The goal is to support human decision-making, not substitute it with scores.
Experimentation has a place, but in music, automation works best when it’s explainable and limit-aware.
What responsible progress in music AI should look like
Progress in music and AI is underpinned by transparency. Teams should be able to understand how a model was trained and how its outputs relate to the audio. When results are interpretable, people can see why a track surfaces and judge for themselves whether the signal makes sense in their own context.
That transparency depends on data choices. Music spans styles, cultures, eras, and uses, and models reflect whatever they are fed. Developers need to work with broad, representative data and be clear about where coverage is thin. Being open about what a model sees, and what it does not, makes its behavior more predictable and its limits easier to manage.
Clear communication matters just as much once tools are in use. For scores and labels to be applied responsibly, teams need a shared understanding of what those signals reflect and where their limits are. Otherwise, even well-intentioned metrics can be stretched beyond what they are able to support.
This kind of openness helps the industry build tools people can understand and trust in real workflows.
We explored how these expectations show up in practice in “The state of AI transparency in music 2025,” a report developed with MediaTracks and Marmoset on how music licensing professionals make decisions around AI, creator background, and context. You can read the full report here.
So… does Meta’s model provide meaningful ratings for music?
Based on these tests, the answer is no. The model produces stable scores, but they don’t map cleanly to how musical quality or complexity are assessed in real catalog work. Instead, the model appears to align more with easily detectable production traits than with the distinctions people consistently make when judging music in context.
That doesn’t make Audiobox Aesthetics insignificant. It can support research by defining a clear scoring framework, showing how reference-free predictors can be trained across speech, music, and sound, and making its models and data available for inspection and comparison. It also illustrates where AES scores can be useful, particularly when large volumes of audio need to be filtered or monitored but full listening is impractical.
Problems emerge when scores like these begin shaping decisions. When a score is presented as a measure of quality, people need to know what it’s actually measuring so they can judge whether it applies to their use case. Without that clarity, it becomes easy to trust the number even when it’s not a good fit.
At Cyanite, we see this as a reminder of the importance of responsibility in music and AI. Progress is driven by systems that stay grounded in real listening behavior and make their assumptions visible.








