20240325
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
propose a data augmentation method for LLM finetune.
- start from a seed dataset, finetune student model on it.
- evaluate on train set, find wrong cases.
- Annotate wrong cases with teacher model
- Append the annotated data to seed dataset and go step 1
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- Mamba as llm backbone for vision-language model.
- Combine DINOv2 and SigLIP feature as vision feature
- MLP for alignment module
MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
In math VLM benchmarks like MathVista, GeoQA, etc, the text contains redundant information, so it might not accurately reveal the model’s visual reasoning ability
propose MATHVERSE benchmark,
20240326
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
propose data mixing scaling law, get optimal data mixing
similar to regression on small runs(small data, model and steps)
experiments are only on 1B models
RakutenAI-7B: Extending Large Language Models for Japanese
7B llm for Japanese
20240401
Are We on the Right Way for Evaluating Large Vision-Language Models?
LVLMs do well on current benchmarks even without visual information(images)
propose MMStar benchmark by sampling from existing benchmarks and filtering(llm accuracy followed with manual filtering)
20240402
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models
Based on MMbench, propose a benchmark in which the problem is unsolvable.
Visual Goal-Step Inference using wikiHow
The wikiHow dataset contains a high-level goal (how to do something) accompanied by interleaved images and text representing the steps.
the task in the paper is to match the goal with images.
202405
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Select text-rich images from laion.
Use “ocr result”, “caption” as context to generate QA pairs with GPT-4.
Fine-tune llava
Odyssey-Math
A hard math benchmark(harder than Math and GSM8K)
ask model to generate json format ({“answer”: “”, “reasoning”: “”}), then use LLM to judge the answer with std
cons:
- it’s better to put the “reasoning” first.
- why json format? it’s better output answer then judged with LLM like MathVista