Papers About Creating a Dataset or Benchmark

March 31, 2023

Benchmark

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

procedure

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

TimelineQA: A Benchmark for Question Answering over Timelines

Large Language Models of Code Fail at Completing Code with Potential Bugs

A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

Program Synthesis with Large Language Models

Two datasets:
- Mostly Basic Programming Problems(MBPP) dataset: crowd-sourced programming dataset
  - sample 100 questions and assign tags to questions
  - the number of lines of the reference solution
  - further manual inspection
  - make sure programs in train and test dataset don’t overlap
- MathQA-Python dataset: derived from MathQA dataset
Conclusion:
- Synthesis performance correlates poorly with BLEU score
- Programs sometimes overfit to assert statements, but it’s not a widespread problem
- Performance is sensitive to prompt examples: using different random seeds, the performance can be increased over $50\%$
Section 8.3: Overview of Benchmarks for Machine Learning over Source Code

NAPS: Natural Program Synthesis Dataset

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks

Tabular dataset

GitTables: A Large-Scale Corpus of Relational Tables

The Stack: 3 TB of permissively licensed source code

Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only