text2sql.data-labeling Tool Project

Start Date	2025. 02. 12
Language	EN (문서상 KO 첨부)

🗃️ Github

Developing a tool to efficiently manage and update data, whitch helps identify and correct errors in pre-built domain-specific SQL queries and Natural Language(Question) data for text2sql AI training
In preparation for future crowdsourcing(multiple users), the tool supports task assignment, monitoring, and versioning for labeling and updating data.
KO

<aside> 🗣

The inefficiency and slow speed of manually editing and correcting errors in pre-made data for text2sql AI training. This process was done by hand, in bulky, time-consuming CSV files.
The need for a Gena-internal tool that provides precise labeling and data modification per column, overcoming the limitations of open-source tools like “Doccano” and “Labeling Studio” (only support labeling).
The ability to enable task distribution through crowdsourcing in the future, delegating labeling and data management tasks to make the process faster.
KO </aside>

<aside> 🔑

Data Management
- Upload pre-made CSV files and download the latest version of data.
- Retrieve data by unique ID or fetch all data along with other versions.
- Modify field values ( sql_query, natural_question) — update, pass, or delete
- Perform CRUD operations on labels (create, update, delete) per data.
Logging & Version Control
- Track updates with logs (who, when, what).
- Retrieve previous versions of modified data.
Data Organization & Collaboration
- Admin creates groups of specific "samples" (rows of dataset).
- Admin assigns dataset groups to users.
- Enable multiple users to label and modify data simultaneously.
Template Retrieval
- Templates (no_sql_template, sql_template values from the CSV file) are automatically saved upon file upload.
- Associate templates with datasets for quick access during data modification.
KO </aside>

<aside> 🤔

⬆️ The user personas, Reviewer and Admin, perform the following stories while using the Gena text2sql labeling tool. Each story is categorized based on its importance or difficulty during the PoC (Proof of Concept) phase as High, Medium, or Low
KO

⬆️ The user personas, Reviewer and Admin, perform their respective tasks when using the Gena text2sql labeling tool
- Admin - grouping and sorting data, assigning it to users /managing and monitoring the datasets
- Reviewer - labels and modifies the assigned data samples, then reviews and requests updates for the modified data
KO </aside>

by Andrew Kim (Figma)

Wireframe for the Frontend (FE) of a simple web app planned during the Ideation and Planning. (PoC)
KO

Made a technical decision about DB, server architecture, and data patterns in cooperation with the AI Backend team.

(by Andrew, Derek)

(by Andrew)

(by Andrew)

(by Andrew)