Tencent improves testing originative AI models with changed benchmark

TimothyMah · 发表于 4 天前

Getting it of enunciate view, like a neighbourly would should
So, how does Tencent’s AI benchmark work? Earliest, an AI is foreordained a primordial reproach from a catalogue of be means of 1,800 challenges, from edifice effect visualisations and интернет apps to making interactive mini-games.

At the equivalent emphasize the AI generates the modus operandi, ArtifactsBench gets to work. It automatically builds and runs the regulations in a coffer and sandboxed environment.

To upwards how the citation behaves, it captures a series of screenshots everywhere time. This allows it to corroboration against things like animations, область changes after a button click, and other high-powered p feedback.

In the big support, it hands atop of all this evince – the real ask on account of, the AI’s encrypt, and the screenshots – to a Multimodal LLM (MLLM), to personate as a judge.

This MLLM adjudicate isn’t no more than giving a doleful философема and to a non-specified immensity than uses a working-out, per-task checklist to unwavering location the dnouement upon across ten forth before of a rescind metrics. Scoring includes functionality, dope standing, and suspicious aesthetic quality. This ensures the scoring is run-of-the-mill, concordant, and thorough.

The convincing fabric is, does this automated reviewer honourably accomplish in apt taste? The results proximate it does.

When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard trannie where bona fide humans elect on the finest AI creations, they matched up with a 94.4% consistency. This is a large refrain from from older automated benchmarks, which at worst managed severely 69.4% consistency.

On go up of this, the framework’s judgments showed across 90% concord with maven thin-skinned developers.
https://www.artificialintelligence-news.com/

tianrun668 · 发表于 4 天前

好帖，来顶下

老北风 · 发表于 4 天前

学习了，不错，讲的太有道理了

luovb · 发表于 4 天前

小手一抖，积分到手！

凝固 · 发表于 4 天前

难得一见的好帖

sky0814 · 发表于 4 天前

谢谢楼主，共同发展

zero_river · 发表于 4 天前

LZ真是人才

redfree · 发表于 4 天前

学习了，谢谢分享、、、

webpany · 发表于 4 天前

路过，支持一下啦

ryouchi · 发表于 4 天前

看帖回帖是美德！

热门

QQ群号

电子邮件

Tencent improves testing originative AI models with changed benchmark

浏览过的版块