Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] argilla: Fix some from_hub method errors #5523

Merged
merged 7 commits into from
Sep 23, 2024

Conversation

frascuchon
Copy link
Member

Description

Trying this code:

import argilla as rg

client = rg.Argilla(...)

dataset = rg.Dataset.from_hub("google-research-datasets/circa")

I found several errors (Here the source dataset):

  • The mapping for records does not match since column names contain uppercase characters
  • Some values for the ClassValue column are unlabelled, making the uncast process fail (-1)
  • Importing cached datasets without config files fails when trying to read the dataset JSON file.

So, this PR tackles all the problems commented below.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • I confirm My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@@ -192,8 +193,11 @@ def from_hub(
@staticmethod
def _log_dataset_records(hf_dataset: "HFDataset", dataset: "Dataset"):
"""This method extracts the responses from a Hugging Face dataset and returns a list of `Record` objects"""
# THIS IS REQUIRED SINCE NAME IN ARGILLA ARE LOWERCASE. The Settings BaseModel models apply this logic
# Ideally, the restrictions in Argilla should be removed and the names should be case-insensitive
hf_dataset = hf_dataset.rename_columns({col: col.lower() for col in hf_dataset.column_names})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we apply the same logic as for the settings names. I only included the change here, but we could have problems importing case-insensitive records.

But maybe the solution would be relaxing this constraint on the server backend cc @jfcalvo . Anyway, I think we currently have a risk applying the name.lower() for all the settings models and we should change it.

@frascuchon frascuchon changed the title [BUGFIX] argilla: Fix some from_hub method errors: [BUGFIX] argilla: Fix some from_hub method errors Sep 20, 2024
@frascuchon frascuchon changed the base branch from develop to main September 23, 2024 10:05
@frascuchon frascuchon changed the base branch from main to develop September 23, 2024 10:05
@frascuchon frascuchon merged commit 0df40d2 into develop Sep 23, 2024
6 checks passed
@frascuchon frascuchon deleted the bugfixes/argilla/from_hub-method branch September 23, 2024 10:08
jfcalvo pushed a commit that referenced this pull request Sep 23, 2024
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

Trying this code:

```python
import argilla as rg

client = rg.Argilla(...)

dataset = rg.Dataset.from_hub("google-research-datasets/circa")
```
I found several errors ([Here the source
dataset](https://huggingface.co/datasets/google-research-datasets/circa/viewer/default/train?f[goldstandard2][value]=0)):
- The mapping for records does not match since column names contain
uppercase characters
- Some values for the ClassValue column are unlabelled, making the
uncast process fail (-1)
- Importing cached datasets without config files fails when trying to
read the dataset JSON file.

So, this PR tackles all the problems commented below.



**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Bug fix (non-breaking change which fixes an issue)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants