Inflection, a well-funded AI startup aiming to create “personal AI” for everyone, has unveiled the large language model powering its Pi conversational agent. It’s hard to rate the quality of this stuff in any way, let alone objectively, systematically, but a little competition is a good thing.
Inflection -1The models are roughly named GPT-3.5 (aka ChatGPT) in size and functionality – measured in terms of the computing power used to train them. The company claims that it is competitive or superior to other models in this class, backed up by a “technical memo” describing some of the benchmarks run on its models (GPT-3.5, LLaMA, Chinchilla, and PaLM-540B) .
According to their published results, Inflection-1 does perform well on a variety of metrics, such as middle school and high school level exam tasks (think Biology 101) and “common sense” benchmarks (e.g. “what if Jack threw the ball on the roof”) , then Jill throws the ball back, where is the ball? ”). It lags mainly in encoding, where GPT-3.5 handily beats it, and GPT-4 leads the competition by comparison; OpenAI’s largest models are known to make huge leaps in quality, so it’s no surprise .
Inflection notes that it hopes to publish results for larger models comparable to GPT-4 and PaLM-2(L), but no doubt they are waiting for results worth publishing. Anyway, Inflection-2 or Inflection-1-XL or whatever is in the oven but not fully baked.
So far, the community has not formally divided AI models into the boxing-weight equivalent of machine learning models, but the concepts do map well to each other. You wouldn’t expect a flyweight to go up against a heavyweight, they’re actually different sports. Same with AI models: small models are not as capable as large models, but small models can run efficiently on phones, while large models require data centers. It’s an apples to oranges thing.
It’s too early to try something like this, since the field is relatively young, and there’s no real consensus on what size and shape of an AI model should be considered a feather.
Ultimately, of course, for most of these models the proof of the pudding lies in the tasting, and until Inflection opens up its models to widespread use and independent evaluation, all of its vaunted benchmarks must be taken with a grain of salt.If you want to give the Pi a try, you can add it on one of your messaging apps, or chat with it live here.