Ming Min from Wofei Temple
Qubits | Official number QbitAI
DALL-E 2, which is popular all over the world for its superb drawing skills, has been questioned for its language skills.
For example, the polysemy word bat, I put it to the test.
A bat is flying over a baseball stadium.
As a result, the picture it draws, bats and rackets are flying in the sky.
And this is not an accidental mistake, if you type "a person is hearing a bat", you still draw that both the bat and the bat are present.
In another case, type a fish and a gold ingot.
Well, just cast both things in gold and turn them into real gold fish.
These mistakes cannot be underestimated, as they imply that DALL-E 2 has a fundamental mapping of symbols to entities in the language in the process of generating images from text.
That is, one word corresponds to one entity.
In the case of bat, drawing a bat or a bat is considered DALL-E 2 to understand correctly, but if you give both, there is a problem.
It's like being a multiple-choice question, filling in A or B is correct, but writing both is a violation of the rules.
What's more, sometimes it also mistakes the modifiers of different objects, "the solution of the previous problem is applied to the next one".
Scholars from Bayilan University and Allen Institute of Artificial Research Intelligence found this problem, and wrote a paper for analysis.
Interestingly, researcher Yoav Goldberg also mentioned that this is not common in mini DALL-E and Stable Diffusion.
I guess this may be because of the so-called inverse scaling.
The simple understanding is "the larger the model, the worse the performance".
What does the paper say?
After discovering the problem, several scholars repeatedly conducted many experiments and divided the problem into three main situations:
- First, a word is interpreted as two different things
- Second, a word is interpreted as a modifier for two different things
- Third, a word, while being interpreted as one thing, is understood as a modifier for another
The first two cases have already been mentioned at the beginning.
In the third case, for example, you enter "a zebra and a street" and the output results always have zebra crossings.
Here, DALL-E 2 explains the zebra twice at the same time.
After repeating the trial for all of these cases, the authors calculated that DALL-E 2 had a more than 80 percent chance of error in all three cases.
The second of these cases had the highest error rate, at 97.2 per cent.
In the third case, mistakes can be avoided if a new modifier is added to another noun.
That is, enter a zebra and a gravel road, and there will be no zebra crossing on the road surface.
With DALL-E mini and Stable Diffusion, these repetitive interpretations are uncommon.
The authors explain that in the future, we can consider studying the text codec of the model to trace these problems, and we can study whether these problems are related to the size of the model and the framework.
One of the authors, Yoav Goldberg, is a Distinguished Professor at Barylan University and Director of Research at the Israel Branch of the Allen Institute for Artificial Intelligence.
Previously, he was a postdoc at Google Research in New York. His research interests are NLP and machine learning, especially syntax parsing.
DALL-E 2 has also been found to be a self-created language
But just a few months ago, a Ph.D. in computer science discovered that feeding DALL-E 2 some strange language could also generate the same kind of images.
These words come from the image generated by DALL-E 2.
For example, after typing "Two farmers talking about vegetables, with subtitles," DALL-E 2 gives images with some "garbled words."
And if you throw the new word "Vicootes" in the image to the model as a description, I didn't expect that such a bunch of images would come out:
There are turnips, there are pumpkins, there are persimmons... Does "Vicootes" stand for vegetables?
If you throw a string of "Apoploe vesrreaitais" from the bubble above to DALL-E 2, a bunch of bird drawings appear:
"Could it be that the word stands for 'bird,' so farmers seem to be talking about birds that affect their vegetables?"
At that time, after the doctor brother posted his findings on the Internet, it immediately caused heated discussions.
Some people try to analyze how DALL-E 2 encrypts the language, and others think it's just noise.
But in general, when it comes to language comprehension, the DALL-E 2 always does something unexpected.
What do you think is the reason behind this?
Paper Address:
https://arxiv.org/pdf/2210.10606.pdf
Reference links:
https://twitter.com/yoavgo/status/1583088957226881025
— End —
Qubit QbitAI · Headline number signing
Follow us and be the first to know the latest technology trends