metadata
base_model: Snowflake/snowflake-arctic-embed-m
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
- dot_accuracy@1
- dot_accuracy@3
- dot_accuracy@5
- dot_accuracy@10
- dot_precision@1
- dot_precision@3
- dot_precision@5
- dot_precision@10
- dot_recall@1
- dot_recall@3
- dot_recall@5
- dot_recall@10
- dot_ndcg@10
- dot_mrr@10
- dot_map@100
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:600
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: >-
How can high compute resource utilization in training GAI models affect
ecosystems?
sentences:
- >-
should not be used in education, work, housing, or in other contexts
where the use of such surveillance
technologies is likely to limit rights, opportunities, or access.
Whenever possible, you should have access to
reporting that confirms your data decisions have been respected and
provides an assessment of the
potential impact of surveillance technologies on your rights,
opportunities, or access.
NOTICE AND EXPLANATION
- >-
Legal Disclaimer
The Blueprint for an AI Bill of Rights: Making Automated Systems Work
for the American People is a white paper
published by the White House Office of Science and Technology Policy. It
is intended to support the
development of policies and practices that protect civil rights and
promote democratic values in the building,
deployment, and governance of automated systems.
The Blueprint for an AI Bill of Rights is non-binding and does not
constitute U.S. government policy. It
does not supersede, modify, or direct an interpretation of any existing
statute, regulation, policy, or
international instrument. It does not constitute binding guidance for
the public or Federal agencies and
- >-
or stereotyping content .
4. Data Privacy: Impacts due to l eakage and unauthorized use,
disclosure , or de -anonymization of
biometric, health, location , or other personally identifiable
information or sensitive data .7
5. Environmental Impacts: Impacts due to high compute resource
utilization in training or
operating GAI models, and related outcomes that may adversely impact
ecosystems.
6. Harmful Bias or Homogenization: Amplification and exacerbation of
historical, societal, and
systemic biases ; performance disparities8 between sub- groups or
languages , possibly due to
non- representative training data , that result in discrimination,
amplification of biases, or
- source_sentence: >-
What are the potential risks associated with human-AI configuration in GAI
systems?
sentences:
- >-
establish approved GAI technology and service provider lists. Value
Chain and Component
Integration
GV-6.1-0 08 Maintain records of changes to content made by third parties
to promote content
provenance, including sources, timestamps, metadata . Information
Integrity ; Value Chain
and Component Integration; Intellectual Property
GV-6.1-0 09 Update and integrate due diligence processes for GAI
acquisition and
procurement vendor assessments to include intellectual property, data
privacy, security, and other risks. For example, update p rocesses
to: Address solutions that
may rely on embedded GAI technologies; Address ongoing monitoring ,
assessments, and alerting, dynamic risk assessments, and real -time
reporting
- >-
could lead to homogenized outputs, including by amplifying any
homogenization from the model used to
generate the synthetic training data .
Trustworthy AI Characteristics: Fair with Harmful Bias Managed, Valid
and Reliable
2.7. Human -AI Configuration
GAI system use can involve varying risks of misconfigurations and poor
interactions between a system
and a human who is interacti ng with it. Humans bring their unique
perspectives , experiences , or domain -
specific expertise to interactions with AI systems but may not have
detailed knowledge of AI systems and
how they work. As a result, h uman experts may be unnecessarily “averse
” to GAI systems , and thus
deprive themselves or others of GAI’s beneficial uses .
- >-
requests image features that are inconsistent with the stereotypes.
Harmful b ias in GAI models , which
may stem from their training data , can also cause representational
harm s or perpetuate or exacerbate
bias based on race, gender, disability, or other protected classes .
Harmful b ias in GAI systems can also lead to harms via disparities
between how a model performs for
different subgroups or languages (e.g., an LLM may perform less well
for non- English languages or
certain dialects ). Such disparities can contribute to discriminatory
decision -making or amplification of
existing societal biases. In addition, GAI systems may be
inappropriately trusted to perform similarly
- source_sentence: >-
What types of content are considered harmful biases in the context of
information security?
sentences:
- >-
MS-2.5-0 05 Verify GAI system training data and TEVV data provenance,
and that fine -tuning
or retrieval- augmented generation data is grounded. Information
Integrity
MS-2.5-0 06 Regularly review security and safety guardrails, especially
if the GAI system is
being operated in novel circumstances. This includes reviewing reasons
why the
GAI system was initially assessed as being safe to deploy. Information
Security ; Dangerous ,
Violent, or Hateful Content
AI Actor Tasks: Domain Experts, TEVV
- >-
to diminished transparency or accountability for downstream users.
While this is a risk for traditional AI
systems and some other digital technologies , the risk is exacerbated
for GAI due to the scale of the
training data, which may be too large for humans to vet; the difficulty
of training foundation models,
which leads to extensive reuse of limited numbers of models; an d the
extent to which GAI may be
integrat ed into other devices and services. As GAI systems often
involve many distinct third -party
components and data sources , it may be difficult to attribute issues in a
system’s behavior to any one of
these sources.
Errors in t hird-party GAI components can also have downstream impacts
on accuracy and robustness .
- >-
biases in the generated content. Information Security ; Harmful Bias
and Homogenization
MG-2.2-005 Engage in due diligence to analyze GAI output for harmful
content, potential
misinformation , and CBRN -related or NCII content . CBRN Information or
Capabilities ;
Obscene, Degrading, and/or
Abusive Content ; Harmful Bias and
Homogenization ; Dangerous ,
Violent, or Hateful Content
- source_sentence: >-
What is the focus of the paper by Padmakumar et al (2024) regarding
language models and content diversity?
sentences:
- >-
Content
MS-2.12- 002 Document anticipated environmental impacts of model
development,
maintenance, and deployment in product design decisions.
Environmental
MS-2.12- 003 Measure or estimate environmental impacts (e.g., energy and
water
consumption) for training, fine tuning, and deploying models: Verify
tradeoffs
between resources used at inference time versus additional resources
required at training time. Environmental
MS-2.12- 004 Verify effectiveness of carbon capture or offset programs
for GAI training and
applications , and address green -washing concerns . Environmental
AI Actor Tasks: AI Deployment, AI Impact Assessment, Domain Experts,
Operation and Monitoring, TEVV
- >-
opportunities, undermine their privac y, or pervasively track their
activity—often without their knowledge or
consent.
These outcomes are deeply harmful—but they are not inevitable. Automated
systems have brought about extraor-
dinary benefits, from technology that helps farmers grow food more
efficiently and computers that predict storm
paths, to algorithms that can identify diseases in patients. These tools
now drive important decisions across
sectors, while data is helping to revolutionize global industries.
Fueled by the power of American innovation,
these tools hold the potential to redefine every part of our society and
make life better for everyone.
- >-
Publishing, Paris . https://doi.org/10.1787/d1a8d965- en
OpenAI (2023) GPT-4 System Card . https://cdn.openai.com/papers/gpt
-4-system -card.pdf
OpenAI (2024) GPT-4 Technical Report.
https://arxiv.org/pdf/2303.08774
Padmakumar, V. et al. (2024) Does writing with language models reduce
content diversity? ICLR .
https://arxiv.org/pdf/2309.05196
Park, P. et. al. (2024) AI deception: A survey of examples, risks,
and potential solutions. Patterns, 5(5).
arXiv . https://arxiv.org/pdf/2308.14752
Partnership on AI (2023) Building a Glossary for Synthetic Media
Transparency Methods, Part 1: Indirect
Disclosure . https://partnershiponai.org/glossary -for-synthetic -media-
transparency -methods -part-1-
indirect -disclosure/
- source_sentence: >-
What are the key components involved in ensuring data quality and ethical
considerations in AI systems?
sentences:
- >-
(such as where significant negative impacts are imminent, severe harms
are actually occurring, or large -scale risks could occur); and broad
GAI negative risks,
including: Immature safety or risk cultures related to AI and GAI
design, development and deployment, public information integrity risks,
including impacts on democratic processes, unknown long -term
performance characteristics of GAI. Information Integrity ; Dangerous
,
Violent, or Hateful Content ; CBRN
Information or Capabilities
GV-1.3-007 Devise a plan to halt development or deployment of a GAI
system that poses unacceptable negative risk. CBRN Information and
Capability ;
Information Security ; Information
Integrity
AI Actor Tasks: Governance and Oversight
- >-
30 MEASURE 2.2: Evaluations involving human subjects meet applicable
requirements (including human subject protection) and are
representative of the relevant population.
Action ID Suggested Action GAI Risks
MS-2.2-001 Assess and manage statistical biases related to GAI content
provenance through
techniques such as re -sampling, re -weighting, or adversarial
training. Information Integrity ; Information
Security ; Harmful Bias and
Homogenization
MS-2.2-002 Document how content provenance data is tracked and how
that data interact s
with privacy and security . Consider : Anonymiz ing data to protect the
privacy of
human subjects; Leverag ing privacy output filters; Remov ing any
personally
- >-
Data quality; Model architecture (e.g., convolutional neural network,
transformers, etc.); Optimizatio n objectives; Training algorithms;
RLHF
approaches; Fine -tuning or retrieval- augmented generation approaches;
Evaluation data; Ethical considerations; Legal and regulatory
requirements. Information Integrity ; Harmful Bias
and Homogenization
AI Actor Tasks: AI Deployment, AI Impact Assessment, Domain Experts,
End -Users, Operation and Monitoring, TEVV
MEASURE 2.10: Privacy risk of the AI system – as identified in the MAP
function – is examined and documented.
Action ID Suggested Action GAI Risks
MS-2.10- 001 Conduct AI red -teaming to assess issues such as:
Outputting of training data
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.8
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.99
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.99
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.8
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.33000000000000007
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.19799999999999998
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.09999999999999998
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.8
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.99
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.99
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9195108324425135
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.8916666666666667
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.8916666666666666
name: Cosine Map@100
- type: dot_accuracy@1
value: 0.8
name: Dot Accuracy@1
- type: dot_accuracy@3
value: 0.99
name: Dot Accuracy@3
- type: dot_accuracy@5
value: 0.99
name: Dot Accuracy@5
- type: dot_accuracy@10
value: 1
name: Dot Accuracy@10
- type: dot_precision@1
value: 0.8
name: Dot Precision@1
- type: dot_precision@3
value: 0.33000000000000007
name: Dot Precision@3
- type: dot_precision@5
value: 0.19799999999999998
name: Dot Precision@5
- type: dot_precision@10
value: 0.09999999999999998
name: Dot Precision@10
- type: dot_recall@1
value: 0.8
name: Dot Recall@1
- type: dot_recall@3
value: 0.99
name: Dot Recall@3
- type: dot_recall@5
value: 0.99
name: Dot Recall@5
- type: dot_recall@10
value: 1
name: Dot Recall@10
- type: dot_ndcg@10
value: 0.9195108324425135
name: Dot Ndcg@10
- type: dot_mrr@10
value: 0.8916666666666667
name: Dot Mrr@10
- type: dot_map@100
value: 0.8916666666666666
name: Dot Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-m
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("XicoC/midterm-finetuned-arctic")
# Run inference
sentences = [
'What are the key components involved in ensuring data quality and ethical considerations in AI systems?',
'Data quality; Model architecture (e.g., convolutional neural network, transformers, etc.); Optimizatio n objectives; Training algorithms; RLHF \napproaches; Fine -tuning or retrieval- augmented generation approaches; \nEvaluation data; Ethical considerations; Legal and regulatory requirements. Information Integrity ; Harmful Bias \nand Homogenization \nAI Actor Tasks: AI Deployment, AI Impact Assessment, Domain Experts, End -Users, Operation and Monitoring, TEVV \n \nMEASURE 2.10: Privacy risk of the AI system – as identified in the MAP function – is examined and documented. \nAction ID Suggested Action GAI Risks \nMS-2.10- 001 Conduct AI red -teaming to assess issues such as: Outputting of training data',
'30 MEASURE 2.2: Evaluations involving human subjects meet applicable requirements (including human subject protection) and are \nrepresentative of the relevant population. \nAction ID Suggested Action GAI Risks \nMS-2.2-001 Assess and manage statistical biases related to GAI content provenance through \ntechniques such as re -sampling, re -weighting, or adversarial training. Information Integrity ; Information \nSecurity ; Harmful Bias and \nHomogenization \nMS-2.2-002 Document how content provenance data is tracked and how that data interact s \nwith privacy and security . Consider : Anonymiz ing data to protect the privacy of \nhuman subjects; Leverag ing privacy output filters; Remov ing any personally',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.8 |
cosine_accuracy@3 | 0.99 |
cosine_accuracy@5 | 0.99 |
cosine_accuracy@10 | 1.0 |
cosine_precision@1 | 0.8 |
cosine_precision@3 | 0.33 |
cosine_precision@5 | 0.198 |
cosine_precision@10 | 0.1 |
cosine_recall@1 | 0.8 |
cosine_recall@3 | 0.99 |
cosine_recall@5 | 0.99 |
cosine_recall@10 | 1.0 |
cosine_ndcg@10 | 0.9195 |
cosine_mrr@10 | 0.8917 |
cosine_map@100 | 0.8917 |
dot_accuracy@1 | 0.8 |
dot_accuracy@3 | 0.99 |
dot_accuracy@5 | 0.99 |
dot_accuracy@10 | 1.0 |
dot_precision@1 | 0.8 |
dot_precision@3 | 0.33 |
dot_precision@5 | 0.198 |
dot_precision@10 | 0.1 |
dot_recall@1 | 0.8 |
dot_recall@3 | 0.99 |
dot_recall@5 | 0.99 |
dot_recall@10 | 1.0 |
dot_ndcg@10 | 0.9195 |
dot_mrr@10 | 0.8917 |
dot_map@100 | 0.8917 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 600 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 600 samples:
sentence_0 sentence_1 type string string details - min: 13 tokens
- mean: 21.67 tokens
- max: 34 tokens
- min: 3 tokens
- mean: 132.86 tokens
- max: 512 tokens
- Samples:
sentence_0 sentence_1 What is the title of the NIST publication related to Artificial Intelligence Risk Management?
NIST Trustworthy and Responsible AI
NIST AI 600 -1
Artificial Intelligence Risk Management
Framework: Generative Artificial
Intelligence Profile
This publication is available free of charge from:
https://doi.org/10.6028/NIST.AI.600 -1Where can the NIST AI 600 -1 publication be accessed for free?
NIST Trustworthy and Responsible AI
NIST AI 600 -1
Artificial Intelligence Risk Management
Framework: Generative Artificial
Intelligence Profile
This publication is available free of charge from:
https://doi.org/10.6028/NIST.AI.600 -1What is the title of the publication released by NIST in July 2024 regarding artificial intelligence?
NIST Trustworthy and Responsible AI
NIST AI 600 -1
Artificial Intelligence Risk Management
Framework: Generative Artificial
Intelligence Profile
This publication is available free of charge from:
https://doi.org/10.6028/NIST.AI.600 -1
July 2024
U.S. Department of Commerce
Gina M. Raimondo, Secretary
National Institute of Standards and Technology
Laurie E. Locascio, NIST Director and Under Secretary of Commerce for Standards and Technology - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 20per_device_eval_batch_size
: 20num_train_epochs
: 5multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 20per_device_eval_batch_size
: 20per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 5max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseeval_use_gather_object
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_map@100 |
---|---|---|
1.0 | 30 | 0.8722 |
1.6667 | 50 | 0.8817 |
2.0 | 60 | 0.8867 |
3.0 | 90 | 0.8867 |
3.3333 | 100 | 0.8917 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.1.0
- Transformers: 4.44.2
- PyTorch: 2.4.0+cu121
- Accelerate: 0.34.2
- Datasets: 2.19.2
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}