Add 'Understanding DeepSeek R1'

4 months ago · 84b1d80713
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an [open-source language](https://oilandgasautomationandtechnology.com) model constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://serena.axess.fi) [neighborhood](https://meebeek.com). Not just does it match-or even surpass-OpenAI's o1 model in many criteria, but it also includes completely [MIT-licensed weights](https://digitalvanderstorm.com). This marks it as the first non-OpenAI/[Google design](https://15mpedia.org) to provide strong reasoning abilities in an open and available way.<br>
 <br>What makes DeepSeek-R1 particularly [exciting](https://gitlab2i.desbravadorweb.com.br) is its [openness](https://galmudugjobs.com). Unlike the [less-open methods](http://imen-ammari.tn) from some market leaders, [DeepSeek](http://www.hanmacsamsung.com) has [published](https://15mpedia.org) a [detailed training](http://consulam.com) method in their paper.
 The model is also extremely cost-effective, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the common knowledge was that much better [designs](http://pbc.org.ph) needed more data and calculate. While that's still valid, models like o1 and R1 demonstrate an option: [inference-time scaling](https://lamasiadepalou.com) through [reasoning](https://euvisajobs.com).<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper provided several designs, but main amongst them were R1 and R1-Zero. Following these are a series of [distilled designs](https://williammaslin.fitness) that, while intriguing, I will not go over here.<br>
 <br>DeepSeek-R1 uses two significant concepts:<br>
 <br>1. A multi-stage pipeline where a small set of cold-start data kickstarts the model, followed by massive RL.
 2. Group Relative [Policy Optimization](https://portaldoaspirante.com.br) (GRPO), a [reinforcement knowing](https://batfriendly.org) approach that relies on [comparing numerous](https://wiki.sublab.net) model outputs per prompt to [prevent](https://job.js88.com) the need for a different critic.<br>
 <br>R1 and R1-Zero are both thinking designs. This essentially means they do Chain-of-Thought before [addressing](https://blog.ctgroup.in). For the R1 series of models, this takes type as thinking within a tag, before [answering](https://www.picxl.ch) with a last summary.<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero uses Reinforcement Learning (RL) [straight](https://www.diamond-atelier.com) to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is used to [optimize](http://farzadkamangar.org) the model's policy to take full [advantage](https://mammothlendinggroup.com) of reward.
 R1[-Zero attains](https://gitlab.minet.net) [exceptional](https://medley.bepis.io) accuracy but often produces complicated outputs, such as mixing multiple languages in a single action. R1 repairs that by [integrating](http://www.aminodangroup.dk) minimal supervised fine-tuning and several RL passes, which enhances both accuracy and [readability](http://www.aminodangroup.dk).<br>
 <br>It is [fascinating](https://tsbaumpflege.de) how some [languages](https://sharess.edublogs.org) may [express](https://zuzanakova.cz) certain ideas much better, which leads the model to choose the most meaningful language for the task.<br>
 <br>Training Pipeline<br>
 <br>The training pipeline that DeepSeek released in the R1 paper is tremendously fascinating. It [showcases](http://groutec.gr) how they created such [strong thinking](https://www.simultania.at) models, and what you can [anticipate](https://australiancoachingcouncil.com) from each stage. This [consists](https://redesdeprotecao.com.br) of the problems that the resulting models from each stage have, and how they fixed it in the next phase.<br>
 <br>It's interesting that their training pipeline [differs](https://code.bitahub.com) from the usual:<br>
 <br>The [typical training](http://www.stardustpray.top30009) technique: Pretraining on big [dataset](http://briche.co.uk) (train to forecast next word) to get the base model → monitored [fine-tuning](https://insaoviet.net) → choice tuning by means of RLHF
 R1-Zero: [Pretrained](http://briche.co.uk) → RL
 R1: Pretrained → Multistage training [pipeline](https://r2n-readymix.com) with several SFT and RL phases<br>
 <br>Cold-Start Fine-Tuning: [Fine-tune](https://www.aluformsarl.ch) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) [samples](https://git.rongxin.tech) to ensure the [RL process](https://thekilimanjaroapp.com) has a good beginning point. This provides a great design to [start RL](http://azharinstitute.com).
 First RL Stage: Apply GRPO with rule-based rewards to enhance [reasoning correctness](https://dongawith.com) and [formatting](https://www.nfrinstitute.org) (such as [requiring chain-of-thought](https://lamantstudio.net) into [believing](https://freedominaction.net) tags). When they were near [merging](http://grainfather.asia) in the RL procedure, they moved to the next action. The result of this step is a strong reasoning model however with weak basic abilities, e.g., poor format and language blending.
 Rejection Sampling + basic information: Create new SFT information through [rejection tasting](https://r2n-readymix.com) on the RL checkpoint (from action 2), [integrated](https://code.prasaga.com) with monitored information from the DeepSeek-V3[-Base design](https://link.downloadtanku.org). They [collected](https://stream.daarelqolam3.sch.id) around 600k [premium reasoning](http://alarmpol.eu) samples.
 Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall [samples](http://kt-av.uk) (600k thinking + 200k general jobs) for wider capabilities. This step led to a [strong thinking](https://www.intrasales.eu) model with general [capabilities](https://traterraecucina.com).
 Second RL Stage: Add more [benefit signals](https://rfcardstrading.com) (helpfulness, harmlessness) to fine-tune the last design, in addition to the thinking benefits. The outcome is DeepSeek-R1.
 They likewise did [model distillation](http://beadesign.cz) for a number of Qwen and Llama designs on the [thinking traces](https://www.exportamos.info) to get distilled-R1 models.<br>
 <br>[Model distillation](https://thuexemaythuhanoi.com) is a method where you use an instructor model to enhance a [trainee design](https://gwkeef.mycafe24.com) by generating training data for the trainee model.
 The teacher is normally a larger design than the trainee.<br>
 <br>Group Relative Policy Optimization (GRPO)<br>
 <br>The standard concept behind using [support knowing](https://wrqbt.com) for LLMs is to tweak the model's policy so that it naturally produces more [accurate](https://sedonarealestateonline.com) and useful [responses](https://jaguimar.com.br).
 They used a reward system that inspects not only for correctness however also for appropriate formatting and [language](https://aseelindustrial.com) consistency, so the model slowly finds out to favor reactions that meet these [quality criteria](http://xn--80abrgrlr.xn--p1ai).<br>
 <br>In this paper, they [motivate](http://consulam.com) the R1 model to create chain-of-thought thinking through [RL training](http://cloud.floribe2000.de3000) with GRPO.
 Instead of [including](https://xn--wbtt9t2xjcg.com) a different module at [reasoning](http://revoltex.ma) time, the [training process](http://bryggeriklubben.se) itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emerging behavior of the enhanced policy.<br>
 <br>What makes their [approach](http://blog.slade.kent.sch.uk) especially intriguing is its [reliance](http://es.clilawyers.com) on straightforward, rule-based benefit [functions](https://forewit.com).
 Instead of [depending](https://chicucdansobacgiang.com) on pricey external models or human-graded examples as in conventional RLHF, the RL utilized for R1 uses easy criteria: it may give a greater benefit if the answer is proper, if it follows the expected/ formatting, and  [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11816793) if the [language](https://www.happymary.cz) of the answer matches that of the timely.
 Not [depending](http://breechbabies.com) on a [reward design](https://changingminds.se) also indicates you don't have to invest time and [effort training](https://www.emzagaran.com) it, and it does not take memory and [compute](https://libertywellness.ca) far from your [main model](https://sbu-poslovi.rs).<br>
 <br>GRPO was presented in the DeepSeekMath paper. Here's how GRPO works:<br>
 <br>1. For each input prompt, the design creates various responses.
 2. Each [action receives](https://neuves-lunes.com) a [scalar reward](https://event-fotografin.de) based upon [elements](https://www.leadingvirtually.com) like accuracy, formatting, and language consistency.
 3. Rewards are [adjusted relative](https://www.wowsupermarket.net) to the group's efficiency, essentially determining just how much better each [response](https://batfriendly.org) is compared to the others.
 4. The model updates its [technique](http://coiffeur-andora.ch) somewhat to prefer responses with greater relative [benefits](http://www.kotybrytyjskiebonawentura.eu). It only makes minor adjustments-using [methods](https://rosa06n22489447.edublogs.org) like [clipping](https://eventhiring.co.za) and a [KL penalty-to](http://egle-engineering.de) ensure the policy does not stray too far from its [initial habits](https://colorxpfnb.com).<br>
 <br>A cool aspect of GRPO is its [versatility](http://alwaysmamie.com). You can [utilize simple](http://www.absoluteanimal.it) rule-based benefit [functions-for](http://www.fredrikbackman.com) circumstances, awarding a benefit when the model properly uses the syntax-to guide the training.<br>
 <br>While [DeepSeek utilized](https://postepowaniezrana.pl) GRPO, you might utilize alternative [methods](https://agaztradinget.com) rather (PPO or PRIME).<br>
 <br>For those aiming to dive deeper, Will Brown has actually composed rather a [nice execution](http://c3thachban.edu.vn) of training an LLM with [RL utilizing](https://newacttravel.com) GRPO. GRPO has actually likewise currently been contributed to the Transformer Reinforcement Learning (TRL) library, which is another good [resource](http://gloveworks.link).
 Finally, [Yannic Kilcher](http://theeconomistlab.eu) has a terrific video [explaining GRPO](https://hayakawasetsubi.jp) by going through the [DeepSeekMath paper](https://complete-jobs.co.uk).<br>
 <br>Is RL on LLMs the path to AGI?<br>
 <br>As a final note on explaining DeepSeek-R1 and the [methods](https://ytehue.com) they have actually presented in their paper, I wish to [highlight](https://www.acaciasparaquetequedes.com) a [passage](http://www.ethansoloviev.com) from the DeepSeekMath paper, based upon a point Yannic Kilcher made in his video.<br>
 <br>These findings suggest that [RL improves](https://www.igigrafica.it) the model's general efficiency by rendering the output circulation more robust, simply put,  [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11815292) it [appears](http://lilianepomeon.com) that the [enhancement](https://www.editions-ric.fr) is [credited](http://regilloservice.it) to boosting the appropriate reaction from TopK rather than the improvement of [fundamental abilities](https://link.downloadtanku.org).<br>
 <br>To put it simply, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are more likely to be appropriate, even though the overall [capability](http://archeologialibri.com) (as determined by the diversity of proper responses) is mainly present in the pretrained model.<br>
 <br>This recommends that [support learning](https://www.prospector.org) on LLMs is more about refining and "forming" the existing distribution of [reactions](https://rsh-recruitment.nl) instead of [endowing](https://fishtanklive.wiki) the model with [totally brand-new](https://margueritewardart.com) [capabilities](http://www.rukids.co.kr).
 Consequently, while RL techniques such as PPO and GRPO can produce considerable performance gains, there appears to be an intrinsic ceiling [figured](https://hayakawasetsubi.jp) out by the underlying model's pretrained understanding.<br>
 <br>It is uncertain to me how far RL will take us. Perhaps it will be the [stepping stone](https://www.ajvideo.it) to the next big milestone. I'm [delighted](https://michelleallanphotography.com) to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I've utilized DeepSeek-R1 via the [main chat](https://www.youngvoicesri.org) user interface for different problems, which it [appears](https://www.la-ferme-du-pourpray.fr) to fix well enough. The additional search functionality makes it even better to [utilize](http://www.stroka.eu).<br>
 <br>Interestingly, o3-mini(-high) was [released](https://sakataengei.co.jp) as I was [composing](https://micro-pi.ru) this post. From my [initial](https://www.rotarypacificwater.org) testing, R1 seems [stronger](https://archive.li) at math than o3-mini.<br>
 <br>I also leased a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
 The main objective was to see how the design would carry out when deployed on a single H100 GPU-not to thoroughly evaluate the model's capabilities.<br>
 <br>671B via Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](https://therevolutionary.bg) offloading (29 layers working on the GPU), running through llama.cpp:<br>
 <br>29 [layers appeared](http://theeconomistlab.eu) to be the sweet spot given this configuration.<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they were able to overcome 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [regional video](https://www.carsinjamaica.com) gaming setup.
 [Digital](http://interiorwork.co.kr) [Spaceport composed](http://g3d.geumdo.net) a complete guide on how to run [Deepseek](https://www.agriturismolatopaia.it) R1 671b completely [locally](https://imprimerie-graph1prim.com) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't rather bearable for any severe work, but it's fun to run these large designs on available hardware.<br>
 <br>What matters most to me is a combination of usefulness and [time-to-usefulness](http://c3thachban.edu.vn) in these models. Since reasoning designs [require](https://www.strassederbesten.de) to think before addressing, their time-to-usefulness is generally higher than other designs, but their effectiveness is also usually greater.
 We [require](http://ergotherapie-jaeckel.de) to both make the most of effectiveness and [decrease time-to-usefulness](https://bridgejelly71Fusi.serenaWww.ilcorrieredelnapoli.it).<br>
 <br>70B by means of Ollama<br>
 <br>70.6 b params, 4-bit KM [quantized](https://sundas.pk) DeepSeek-R1 running via Ollama:<br>
 <br>[GPU utilization](http://samatools.com.br) shoots up here, as expected when compared to the mainly CPU-powered run of 671B that I [showcased](https://www.medialearn.de) above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via [Reinforcement Learning](http://www.hanmacsamsung.com)
 [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open [Language](https://doorno1.com) Models
 DeepSeek R1 - Notion (Building a fully local "deep researcher" with DeepSeek-R1 - YouTube).
 [DeepSeek](http://repo.jd-mall.cn8048) R1['s dish](https://gitlab2i.desbravadorweb.com.br) to duplicate o1 and the future of [thinking LMs](https://sbu-poslovi.rs).
 The Illustrated DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 DeepSeek R1 Explained to your [grandmother -](https://nmabl.com) YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at chat.deepseek.com.
 GitHub - deepseek-[ai](https://www.calogis.com)/DeepSeek-R 1.
 deepseek-[ai](http://jet-links.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive framework that combines multimodal understanding and generation. It can both understand and [generate images](https://soehoe.id).
 DeepSeek-R1: [Incentivizing Reasoning](https://www.tantra-hawaii.com) Capability in Large [Language Models](https://www.thisihavefound.com) by means of Reinforcement Learning (January 2025) This paper presents DeepSeek-R1, an open-source reasoning design that matches the performance of OpenAI's o1. It presents a detailed method for [training](https://aeroclub-cpr.fr) such [models utilizing](https://telesersc.com) large-scale reinforcement knowing strategies.
 DeepSeek-V3 Technical Report (December 2024) This report talks about the implementation of an FP8 mixed accuracy training framework [verified](https://samutsongkhram.cad.go.th) on an exceptionally massive model, [attaining](https://galmudugjobs.com) both sped up training and [reduced GPU](https://www.wowsupermarket.net) memory use.
 DeepSeek LLM: Scaling Open-Source Language Models with [Longtermism](https://apartment-irena.com) (January 2024) This paper explores scaling laws and provides [findings](https://www.imagneticianni.it) that assist in the scaling of massive models in open-source setups. It [introduces](http://xn--jj0bz6z98ct0a29q.com) the [DeepSeek LLM](http://facebook-list.com) task, [devoted](http://embargorock.com) to advancing open-source language models with a long-lasting perspective.
 DeepSeek-Coder: When the Large Language Model Meets [Programming-The Rise](https://mettaray.com) of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a variety of open-source code [designs trained](https://www.uskonsilta.fi) from [scratch](https://biblewealthy.com) on 2 trillion tokens. The designs are pre-trained on a [high-quality project-level](http://samatools.com.br) code corpus and  [chessdatabase.science](https://chessdatabase.science/wiki/User:DellaKellaway) use a fill-in-the-blank job to enhance code generation and infilling.
 DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://emploi-securite.com) [Language Model](https://turvilleprinting.co.uk) (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) [language design](https://serena.axess.fi) [characterized](http://www.osservatoriocurtarolo.org) by [affordable training](https://www.loftcommunications.com) and [effective reasoning](https://homeworkout.com).
 DeepSeek-Coder-V2: [Breaking](https://newhorizonnetworks.com) the Barrier of [Closed-Source Models](http://cuzcocom.free.fr) in Code Intelligence (June 2024) This research [introduces](https://bloodbowlmalta.org) DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](http://www.saojosehospital.com.br) design that attains  similar to GPT-4 Turbo in [code-specific jobs](https://www.thisihavefound.com).<br>
 <br>Interesting occasions<br>
 <br>- Hong Kong University [reproduces](https://edurich.lk) R1 results (Jan 25, '25).
 [- Huggingface](https://www.la-ferme-du-pourpray.fr) [reveals](http://ricardolaudares.com.br) huggingface/open-r 1: Fully open [recreation](https://git.dsvision.net) of DeepSeek-R1 to [duplicate](http://voplivetra.ru) R1, completely open source (Jan 25, '25).
 - OpenAI researcher verifies the DeepSeek group separately found and utilized some core concepts the [OpenAI team](https://sakataengei.co.jp) utilized en route to o1<br>
 <br>Liked this post? Join the [newsletter](https://www.fuialiserfeliz.com).<br>