Home README

Scaling Transformers to 1B Tokens, Practical Rowhammer Fingerprinting, Conservation Laws for Gradient Flows, Mixture-of-Experts with Instruction Tuning Win

Joe H.
July 07, 2023

Welcome back to another deep dive into the cutting-edge world of research papers. Today, we’re tackling everything from the L ONG N ET Transformer variant’s unprecedented ability to handle a whopping 1 billion tokens, to the intriguing technique of Rowhammer fingerprinting with Centauri, and the geometric complexities of gradient descent in machine learning. We’re also delving into the benefits of instruction tuning for Mixture-of-Experts models in large language models. As always, we’ll be spicing things up with a dash of discussion from the ever-insightful Hacker News community. So, buckle up and prepare for an intellectual adventure through the latest in tech research.

Top Papers

1) Scaling Transformers to 1000000000 Tokens

Summary:

The L ONG N ET Transformer variant has the ability to process sequences up to 1 billion tokens with dilated attention while still performing well on shorter sequences.

View PDF | Chat with this paper

Copy slides outline   Copy embed code   Download as Word

Scaling Transformers to 1,000,000,000 Tokens

Source: arxiv.org - PDF - 6,326 words - view

Hacker News:

Scaling transformers to 1 billion tokens is crucial for capturing long-range dependencies in text sequences and achieving AGI, although the adequacy of computational scale for models is a topic of debate. View on HN

  • The scaling of transformers to 1 billion tokens is discussed.
  • Concerns are raised about the effectiveness of attention mechanisms in capturing long-range dependencies in text sequences.
  • The human brain has 150 trillion synapses/parameters, while GPT-3 has 175 billion parameters.
  • There is an ongoing debate about the computational scale for models like GPT-3 and the need for further scaling.
  • The number of tokens in a language model determines the length of the context window.

2) Centauri Practical Rowhammer Fingerprinting

Summary:

Centauri is a reliable technique that exploits manufacturing process variations to create distinct and consistent fingerprints across devices for Rowhammer fingerprinting.

View PDF | Chat with this paper

Copy slides outline   Copy embed code   Download as Word

Centauri Practical Rowhammer Fingerprinting: Building Unique and Stable Fingerprints

Source: arxiv.org - PDF - 14,522 words - view

Hacker News:

Centauri is a method that uses Rowhammer attacks to obtain computer fingerprints for unique identification purposes. View on HN

  • Centauri: Practical Rowhammer Fingerprinting is a method to obtain a fingerprint of a computer using a Rowhammer attack.
  • This fingerprint can uniquely identify a computer, even among those with identical hardware and software.
  • The technique can be implemented in native code and possibly in JavaScript, though less reliably and more slowly.
  • There is currently no widespread and effective mitigation for Rowhammer techniques, making devices more vulnerable over time.
  • The design defect that allows Rowhammer to work has not been corrected, despite being known for almost a decade.

3) Scaling Transformers to 1000000000 Tokens

Summary:

The L ONG N ET Transformer variant has the ability to process sequences up to 1 billion tokens with dilated attention while still performing well on shorter sequences.

View PDF | Chat with this paper

Copy slides outline   Copy embed code   Download as Word

Scaling Transformers to 1,000,000,000 Tokens

Source: arxiv.org - PDF - 6,326 words - view

4) Conservation Laws for Gradient Flows

Summary:

The article examines the geometric aspects of gradient descent in machine learning, focusing on conservation laws and the preservation of functions during optimization.

View PDF | Chat with this paper

Copy slides outline   Copy embed code   Download as Word

The Geometric Aspects of Gradient Descent in Machine Learning

Source: arxiv.org - PDF - 19,160 words - view

5) Mixture-of-Experts Meets Instruction Tuning

Summary:

The paper discusses the benefits of instruction tuning for Mixture-of-Experts models in comparison to dense models in large language models.

View PDF | Chat with this paper

Copy slides outline   Copy embed code   Download as Word

Mixture-of-Experts Meets Instruction Tuning

Source: arxiv.org - PDF - 15,911 words - view

Ready for more?

Check out other posts from this blog.

View all »