jenchen41's picture
Update citation link in README
7d83017 verified
|
raw
history blame
11.8 kB
metadata
license: gemma
library_name: transformers
pipeline_tag: text-generation
extra_gated_heading: Access Gemma on Hugging Face
extra_gated_prompt: >-
  To access Gemma on Hugging Face, you’re required to review and agree to
  Google’s usage license. To do this, please ensure you’re logged in to Hugging
  Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license
tags:
  - conversational
base_model: google/gemma-2-27b-it

DataGemma RIG model card

Resources and Technical Documentation:

Terms of Use: Terms

Authors: Google

Model Information

Description

DataGemma is a series of fine-tuned Gemma 2 models used to help LLMs access and incorporate reliable public statistical data from Data Commons into their responses. DataGemma RIG is used in the retrieval interleaved generation approach (based off of tool-use approaches), where it is trained to annotate a response with natural language queries to Data Commons’ existing natural language interface wherever there are statistics. More information can be found in this research paper.

Inputs and outputs

  • Input: Text string, such as a question or a prompt.
  • Output: Generated English-language text in response to the input where statistics in the response are annotated with [__DC__("<natural language query to fetch the statistic from Data Commons>") --> "<LLM generated statistic>"].

Usage

Below we provide a code snippet to run the fine-tuned model, which is just one step in the complete RIG approach explained in the DataGemma paper. You can try out the end-to-end RIG flow in this colab notebook.

To run this model, first make sure to pip install -U transformers accelerate, then copy the code snippet from the following section.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
Example output
Sunnyvale, CA has seen significant progress in gender, age, and educational demographics in recent years. Here's a breakdown:

**Gender:**

* **Overall:** Sunnyvale has a relatively equal gender distribution, with a slight skew towards male residents.
* **Recent Shifts:**

    * **2010-2020:** The female population in Sunnyvale has been steadily increasing, reaching a [__DC__("what was the percentage of females in Sunnyvale, CA in 2020?") --> "51.6%"] majority in 2020. This is largely due to its proximity to Silicon Valley and the increasing number of women working in tech.
    * **Age:** Sunnyvale has a relatively young population, with a median age of around [__DC__("what was the median age of residents in Sunnyvale, CA in 2020?") --> "35"]. This reflects the dominance of the tech industry, which tends to favor younger demographics.
    * **Immigration:** Sunnyvale has a significant immigrant population, with many residents hailing from countries in Central and South America, Asia, and the Middle East.

**Age:**

* **Recent Shifts:**

    * **2010-2020:** Sunnyvale has seen a slight decrease in the median age, driven by a competitive job market and a desire for better work-life balance.
    * **Aging Population:** While Sunnyvale has a young population overall, it's also aging. This means the proportion of older residents is increasing, which can put pressure on healthcare and social services.

**Race & Ethnicity:**

* **Majority-Minority City:** Sunnyvale is a majority-minority city, meaning that no single racial or ethnic group constitutes a majority of the population.
* **Key Demographics:**

    * **White:** [__DC__("what percentage of residents in Sunnyvale, CA are white in 2020?") --> "55.1%"]
    * **Hispanic or Latino:** [__DC__("what percentage of residents in Sunnyvale, CA are hispanic or latino in 2020?") --> "30.6%"]
    * **Asian:** [__DC__("what percentage of residents in Sunnyvale, CA are asian in 2020?") --> "16.8%"]
    * **Black or African American:** [__DC__("what percentage of residents in Sunnyvale, CA are black or african american in 2020?") --> "3.6%"]
    * **Other:** [__DC__("what percentage of residents in Sunnyvale, CA are classified as other race in 2020?") --> "0.5%"]
* **Trends & Issues:**

    * **Tech Dominance:** Sunnyvale's proximity to tech jobs drives a younger, more diverse workforce.
    * **Education & Income:**

A highly educated population in Sunnyvale translates to a highly skilled workforce, commanding higher salaries on average.
* **Racial & Ethnic Diversity:**

The city's commitment to diversity and inclusion is reflected in its demographics.
* **Aging Population:**

The aging population presents challenges for healthcare and social services, while a younger population puts pressure on schools and childcare.

**Economic Conditions:**

* **Low Unemployment Rate:** Sunnyvale consistently boasts a low unemployment rate, indicating a strong tech sector.
* **High Median Household Income:**

The median household income in Sunnyvale is significantly higher than the national average, driven by the high demand for skilled labor in the area.
* **Competitive Landscape:** Sunnyvale faces increasing competition for skilled workers and struggles with housing affordability and traffic congestion.

**Education Levels:**

* **High Percentage of Adults with Bachelor's Degree or Higher:**

A highly educated population is a key driver of the city's workforce and economy.
* **Emphasis on STEM Education & Innovation:**

Sunnyvale schools place a strong emphasis on STEM education and innovation, reflecting the region's tech-driven culture.

**Challenges:**

* **Gender Gap:**

The tech industry has a persistent gender gap, with women holding a smaller percentage of jobs than men.

* **Age Diversity:**

While Sunnyvale has a young population overall, it's important to ensure a balance of age groups in the workforce.
* **Immigration Policy:**

Immigration policy debates and enforcement can create uncertainty and hardship for immigrant communities.

**Resources:**

* **Sunnyvale Chamber of Commerce:** https://www.sunnyvalecoc.org/
* **City of Sunnyvale:** https://www.sunnyvale.ca.gov/
* **Stanford University:** https://www.stanford.edu/

**Note:**

These are just overarching trends. It's important to consult reliable sources like the U.S. Census Bureau and the Bureau of Labor Statistics for more detailed and up-to-date information.

Run in 4-bit via bitsandbytes

To run this model, first make sure to pip install -U transformers bitsandbytes accelerate, then copy the code snippet from the following section.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

Citation

@misc{radhakrishnan2024knowing,
      title={Knowing When to Ask - Bridging Large Language Models and Data}, 
      author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
      year={2024},
      eprint={2409.13741},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.13741}, 
}

Model Data

The base model was trained on a dataset of text data that includes a wide variety of sources, see the Gemma 2 documentation for more details. The DataGemma RIG model is fine-tuned on synthetically generated data. More details can be found in the DataGemma paper.

Implementation Information

Like Gemma, DataGemma RIG was trained on TPUv5e, using JAX.

Evaluation

Evaluation on the model was done as part of evaluation on the full RIG workflow and documented in the DataGemma paper.

Ethics and Safety

We are releasing an early version of the models. They are meant for academic and research purposes and are not ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory, behavior. Please anticipate errors and limitations as we actively develop this LLM interface.

  • We red teamed and checked the Data Commons Natural Language interface pre-launch against a set of potentially dangerous queries that could result in misleading, controversial, or inflammatory results.
  • We ran these same queries against the outputs of the RIG and RAG models, finding a few examples where query responses were controversial, but not dangerous.
  • As this model is meant purely for academic and research purposes, it has not been subjected to our usual safety evaluations.

Usage and Limitations

These models have certain limitations that users should be aware of.

This is a very early version of DataGemma RIG. It is meant for trusted tester use (primarily for academic and research use) and not yet ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory behavior. Please anticipate errors and limitations as we actively develop this large language model interface.

Your feedback and evaluations are critical to refining DataGemma's performance and will directly contribute to its training process. Known limitations are detailed in the DataGemma paper, and we encourage you to consult it for a comprehensive understanding of DataGemma's current capabilities.