[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.] [@LiJunnan0409](/creator/twitter/LiJunnan0409) "1 vision token = XX text tokens is not something that can be concluded based on their experiments. XXX% text reconstruction does not imply that the vision tokens encode all textual information since the language decoder plays a big role. Need to remove the language prior to get a more accurate compression ratio. E.g. what if the image contains text in non-readable order" [X Link](https://x.com/LiJunnan0409/status/1980446374144667774) [@LiJunnan0409](/creator/x/LiJunnan0409) 2025-10-21T01:29Z 2777 followers, XXX engagements "Maybe ChatGPT Atlas should consider using our GTA1 grounder 😉" [X Link](https://x.com/LiJunnan0409/status/1980785983382765740) [@LiJunnan0409](/creator/x/LiJunnan0409) 2025-10-21T23:59Z 2777 followers, XXX engagements "There lacks evidence that pixel-based representations are more compact than representing language directly as text tokens. The saying An image is worth a thousand words actually implies that an image can be interpreted in countless ways most of which are irrelevant to language understanding. Even though a vision encoder can abstract away much of this irrelevant information it raises a question: why go through that detour when we can represent meaning directly with text tokens" [X Link](https://x.com/LiJunnan0409/status/1980788548375835071) [@LiJunnan0409](/creator/x/LiJunnan0409) 2025-10-22T00:09Z 2777 followers, 2380 engagements
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
@LiJunnan0409
"1 vision token = XX text tokens is not something that can be concluded based on their experiments. XXX% text reconstruction does not imply that the vision tokens encode all textual information since the language decoder plays a big role. Need to remove the language prior to get a more accurate compression ratio. E.g. what if the image contains text in non-readable order"
X Link @LiJunnan0409 2025-10-21T01:29Z 2777 followers, XXX engagements
"Maybe ChatGPT Atlas should consider using our GTA1 grounder 😉"
X Link @LiJunnan0409 2025-10-21T23:59Z 2777 followers, XXX engagements
"There lacks evidence that pixel-based representations are more compact than representing language directly as text tokens. The saying An image is worth a thousand words actually implies that an image can be interpreted in countless ways most of which are irrelevant to language understanding. Even though a vision encoder can abstract away much of this irrelevant information it raises a question: why go through that detour when we can represent meaning directly with text tokens"
X Link @LiJunnan0409 2025-10-22T00:09Z 2777 followers, 2380 engagements
/creator/twitter::4716962310/posts