Add 'Understanding DeepSeek R1'

5 months ago · d0fe899012
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an open-source language design constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://www.telefonospam.es) neighborhood. Not only does it match-or even surpass-OpenAI's o1 design in lots of benchmarks, but it also includes fully MIT-licensed weights. This marks it as the first non-OpenAI/Google model to deliver [strong thinking](http://zhuolizs.com) [abilities](http://mafsinnovations.com) in an open and available way.<br>
+<br>What makes DeepSeek-R1 particularly amazing is its openness. Unlike the less-open methods from some market leaders, DeepSeek has actually released a [detailed training](https://tashkent-travel.uz) method in their paper.
+The model is also remarkably cost-effective, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and [output tokens](https://raida-bw.com) at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the [typical knowledge](https://stic.org.ng) was that much better models required more information and compute. While that's still legitimate, models like o1 and R1 show an option: inference-time scaling through thinking.<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented several models, but main among them were R1 and R1-Zero. Following these are a series of distilled designs that,  [wiki.insidertoday.org](https://wiki.insidertoday.org/index.php/User:MZSLouise20) while intriguing, I will not go over here.<br>
+<br>DeepSeek-R1 uses 2 major ideas:<br>
+<br>1. A multi-stage pipeline where a little set of cold-start information kickstarts the model, followed by massive RL.
+2. Group Relative Policy Optimization (GRPO), a reinforcement knowing method that relies on comparing several model outputs per timely to avoid the need for a separate critic.<br>
+<br>R1 and R1-Zero are both reasoning models. This [essentially](https://xn--wbtt9t2xjcg.com) [suggests](https://hanakoiine.com) they do Chain-of-Thought before answering. For the R1 series of models, this takes form as thinking within a tag, before answering with a last summary.<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is [utilized](https://ruraltv.in) to enhance the model's policy to take full advantage of benefit.
+R1-Zero attains outstanding accuracy however often produces confusing outputs, such as mixing numerous languages in a single action. R1 repairs that by including limited monitored fine-tuning and several RL passes, which improves both accuracy and readability.<br>
+<br>It is [fascinating](https://git.sitenevis.com) how some languages might reveal certain ideas better, which leads the design to select the most expressive language for the task.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that  released in the R1 paper is immensely interesting. It [showcases](https://commune-rinku.com) how they developed such [strong thinking](https://geodezjarawa.pl) designs, and what you can expect from each stage. This consists of the issues that the resulting designs from each phase have, and how they resolved it in the next stage.<br>
+<br>It's interesting that their training pipeline varies from the usual:<br>
+<br>The normal training technique: [Pretraining](https://munidigital.iie.cl) on big dataset (train to [forecast](http://tombengtson.com) next word) to get the base design → supervised fine-tuning → preference tuning through RLHF
+R1-Zero: Pretrained → RL
+R1: Pretrained → Multistage training pipeline with numerous SFT and RL stages<br>
+<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand [Chain-of-Thought](https://www.mondzorgijsselmonde.nl) (CoT) samples to ensure the RL procedure has a [decent starting](https://www.hdfurylinker.com) point. This provides a great model to begin RL.
+First RL Stage: Apply GRPO with rule-based rewards to improve thinking accuracy and format (such as requiring chain-of-thought into believing tags). When they were near merging in the RL procedure, they moved to the next action. The result of this step is a strong reasoning model however with weak general capabilities, e.g., bad formatting and [language](http://adasucevre.com) mixing.
+[Rejection Sampling](https://kanzlei-melle.de) + general data: Create brand-new SFT data through rejection sampling on the [RL checkpoint](http://dscomics.nl) (from action 2), combined with [supervised data](http://gulfstreamkw.com) from the DeepSeek-V3-Base design. They collected around 600k premium thinking samples.
+Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k reasoning + 200k basic jobs) for wider [abilities](https://mppro.be). This [step led](https://parisinnar.com) to a strong thinking model with general abilities.
+Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to improve the final design, in addition to the reasoning benefits. The result is DeepSeek-R1.
+They likewise did design distillation for  [funsilo.date](https://funsilo.date/wiki/User:RamonDurbin2267) several Qwen and Llama models on the thinking traces to get distilled-R1 models.<br>
+<br>Model distillation is a technique where you utilize a teacher model to enhance a trainee model by generating training data for the trainee model.
+The [instructor](http://gitea.anomalistdesign.com) is usually a bigger design than the trainee.<br>
+<br>Group Relative Policy Optimization (GRPO)<br>
+<br>The basic idea behind using support learning for LLMs is to fine-tune the design's policy so that it naturally produces more accurate and useful responses.
+They [utilized](https://www.masseriapietrascritta.it) a reward system that examines not just for correctness however also for proper formatting and language consistency, so the model slowly learns to prefer actions that fulfill these quality criteria.<br>
+<br>In this paper, they motivate the R1 model to create chain-of-thought thinking through RL training with GRPO.
+Instead of [including](https://www.veranda-geneve.ch) a different module at reasoning time, the training process itself pushes the model to produce detailed, detailed outputs-making the chain-of-thought an emergent behavior of the optimized policy.<br>
+<br>What makes their technique particularly interesting is its reliance on straightforward, rule-based benefit functions.
+Instead of depending upon pricey external models or human-graded examples as in conventional RLHF, the RL used for R1 uses easy criteria: it may give a higher benefit if the response is right, if it follows the anticipated/ format, and if the language of the [response matches](https://deval.cl) that of the prompt.
+Not depending on a reward model likewise indicates you don't have to spend time and effort training it, and it doesn't take memory and compute away from your [main design](https://totalchangeprogram.com).<br>
+<br>GRPO was presented in the [DeepSeekMath paper](http://trustmobilizer.com). Here's how GRPO works:<br>
+<br>1. For each input timely, the model generates various actions.
+2. Each [response receives](https://hindichudaikahani.com) a scalar reward based upon factors like precision, format, and language consistency.
+3. Rewards are changed relative to the group's efficiency, essentially measuring just how much better each action is compared to the others.
+4. The [design updates](https://techjobs.lset.uk) its method a little to favor actions with greater relative benefits. It only makes minor adjustments-using methods like clipping and a KL penalty-to make sure the policy doesn't wander off too far from its original habits.<br>
+<br>A cool element of GRPO is its versatility. You can [utilize basic](https://my-sugar.co.il) rule-based benefit functions-for circumstances, awarding a [benefit](https://sahin-homes.de) when the design properly uses the syntax-to guide the training.<br>
+<br>While DeepSeek utilized GRPO, you might utilize alternative [techniques](https://xn--v69atsro52ncsg2uqd74apxb.com) instead (PPO or  [suvenir51.ru](http://suvenir51.ru/forum/profile.php?id=15599) PRIME).<br>
+<br>For those aiming to dive deeper, Will Brown has [composed](http://mafsinnovations.com) quite a [nice application](http://www.usrecords.at) of [training](https://trilogi.co.id) an LLM with RL using GRPO. GRPO has actually likewise currently been added to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
+Finally, Yannic Kilcher has a great video explaining GRPO by going through the [DeepSeekMath paper](https://wiki.asexuality.org).<br>
+<br>Is RL on LLMs the path to AGI?<br>
+<br>As a last note on [explaining](https://alivechrist.com) DeepSeek-R1 and the approaches they have actually provided in their paper, I desire to highlight a passage from the DeepSeekMath paper, based on a point [Yannic Kilcher](https://destinymalibupodcast.com) made in his video.<br>
+<br>These findings suggest that RL boosts the model's total performance by rendering the [output distribution](http://gitlab.kci-global.com.tw) more robust, in other words, it appears that the improvement is credited to increasing the proper response from TopK rather than the enhancement of essential abilities.<br>
+<br>To put it simply, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are most likely to be right, even though the general ability (as measured by the variety of proper responses) is mainly present in the pretrained design.<br>
+<br>This suggests that reinforcement learning on LLMs is more about refining and "shaping" the existing distribution of actions instead of endowing the model with totally new capabilities.
+Consequently, while RL methods such as PPO and GRPO can produce considerable performance gains, there appears to be an inherent ceiling [figured](https://acamaths.com) out by the underlying design's pretrained knowledge.<br>
+<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next huge milestone. I'm excited to see how it unfolds!<br>
+<br>[Running](http://red-key.ru) DeepSeek-R1<br>
+<br>I've utilized DeepSeek-R1 through the [main chat](https://dailymoments.nl) interface for numerous issues, which it [appears](http://annagruchel.com) to fix well enough. The extra search functionality makes it even nicer to utilize.<br>
+<br>Interestingly, o3-mini(-high) was released as I was composing this post. From my initial screening, R1 seems stronger at [mathematics](https://www.fh-elearning.com) than o3-mini.<br>
+<br>I also rented a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://www.englishtrainer.ch).
+The [main objective](https://www.lacolleraye.fr) was to see how the model would carry out when [released](https://www.ortomania.pl) on a single H100 [GPU-not](https://www.branchcounseling.com) to thoroughly evaluate the [design's capabilities](https://angiesstays.com).<br>
+<br>671B through Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers running on the GPU), running through llama.cpp:<br>
+<br>29 [layers appeared](http://www.adebaconnector.com) to be the sweet area provided this setup.<br>
+<br>Performance:<br>
+<br>A r/localllama user explained that they were able to get over 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [regional video](https://www.fluencycheck.com) gaming setup.
+Digital Spaceport composed a full guide on how to run Deepseek R1 671b completely locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite manageable for any major work, however it's enjoyable to run these big models on available [hardware](https://meeting2up.it).<br>
+<br>What matters most to me is a combination of effectiveness and time-to-usefulness in these models. Since thinking models need to think before responding to, their time-to-usefulness is generally greater than other designs, however their usefulness is also typically higher.
+We require to both optimize effectiveness and lessen time-to-usefulness.<br>
+<br>70B by means of Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:<br>
+<br>[GPU utilization](https://gamehiker.com) shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
+[DeepSeek](https://floristeriazahara.com) R1 - Notion ([Building](https://1colle.com) a fully local "deep researcher" with DeepSeek-R1 - YouTube).
+DeepSeek R1's recipe to replicate o1 and the future of thinking LMs.
+The Illustrated DeepSeek-R1 - by [Jay Alammar](https://ortocinetica.com).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 Explained to your [grandma -](https://www.mikeclover.com) YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at [chat.deepseek](http://dev.nextreal.cn).com.
+GitHub - deepseek-[ai](https://dashrsports.com)/[DeepSeek-R](https://www.associazioneabruzzesinsw.com.au) 1.
+deepseek-[ai](https://www.clivago.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](https://islandfinancecuracao.com) [structure](http://www.boot-gebraucht.de) that merges multimodal understanding and generation. It can both comprehend and [produce images](https://gitlab.edebe.com.br).
+DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source [thinking design](https://work.spaces.one) that equals the [performance](https://gitlab.winehq.org) of OpenAI's o1. It provides a detailed method for training such models utilizing large-scale support learning strategies.
+DeepSeek-V3 [Technical Report](https://onzedemaio.com.br) (December 2024) This report discusses the implementation of an FP8 [combined precision](http://xturn.co.kr) training structure validated on an incredibly massive design, attaining both sped up training and minimized GPU memory usage.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper delves into [scaling](https://zapiski-mudreca.pro) laws and presents findings that help with the scaling of large-scale models in open-source configurations. It presents the DeepSeek LLM project, dedicated to advancing open-source language models with a long-lasting perspective.
+DeepSeek-Coder: When the Large [Language Model](https://ru.lublanka.cz) [Meets Programming-The](https://git.cloudtui.com) Rise of Code Intelligence (January 2024) This research presents the DeepSeek-Coder series, a series of open-source code [designs trained](https://pcpuniversal.com) from scratch on 2 trillion tokens. The models are pre-trained on a high-quality project-level code corpus and use a fill-in-the-blank task to boost code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language](https://mrprarquitectos.com) Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language design characterized by [cost-effective training](http://frontrangecycle.com) and efficient reasoning.
+DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code [Intelligence](http://vvs5500.ru) (June 2024) This research [introduces](http://www.gkproductions.com) DeepSeek-Coder-V2, an [open-source](https://edujobs.itpcrm.net) Mixture-of-Experts (MoE) [code language](https://xn--afriquela1re-6db.com) model that attains performance similar to GPT-4 Turbo in [code-specific tasks](https://git.sicom.gov.co).<br>
+<br>Interesting occasions<br>
+<br>- Hong Kong University [reproduces](http://8.136.199.333000) R1 results (Jan 25, '25).
+- Huggingface reveals huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to [replicate](http://www.package.dofollowlinks.org) R1, fully open source (Jan 25, '25).
+- OpenAI researcher verifies the DeepSeek team independently found and utilized some core ideas the OpenAI group used on the way to o1<br>
+<br>Liked this post? Join the newsletter.<br>