Shankar Sivarajan

Model

Summary: The effort required to manually do the calculations an LLM does to answer a simple query (in Chinese, for the Searle's Room reference) is about what it'd take to build a modern million-man city from scratch.

Model:

Say a human can perform 1 multiply-accumulate (MAC) operation every 5 seconds.

First, we produce an estimate for single token generation for Llama 3 8B: 8 billion parameters, about 2 MAC operations per parameter, and with some additional overhead for attention mechanisms, feedforward layers, and other computations, estimate 50 billion MAC operations per token.

That's 250×109 seconds/token ≈ 70×106 hours.

Estimate full-time work for a year is 8 hours/day, 5 days/week, 50 weeks/year ≈ 2000 hours/year.

70×106 hours ÷ 2,000 hours/man-year ≈ 35,000 man-years/token.

Tokens in a simple Chinese question + answer pair:

Question: ~5–10 tokens; Answer: ~10–30 tokens; Total: ~15–40 tokens.

So in total, about 500,000–1,500,000 man-years.

For building a city, the most important factors are

Infrastructure Construction (3–5 years):

Roads, bridges, and transportation networks.
Water supply systems (reservoirs, pipelines, treatment plants).
Sewage and waste management systems.
Electrical grids, telecommunications, and internet infrastructure.

Labor: ~10,000 workers.

Man-years: 30,000–50,000 man-years.

Residential and Commercial Buildings (5–10 years):

Construction of housing for ~1 million people (apartments, single-family homes).
Building commercial spaces (offices, shops, markets).
Interior finishing and utilities installation.

Labor: ~20,000 workers.

Man-years: 100,000–200,000 man-years.

Including planning and design, site preparation (clearing land, building access road, and excavation for foundations), estimate about 150,000–300,000 man-years depending on the size.

Validating this estimate, the city of Brasília, built in the 1950s, took about 5 years to construct a city for ~500,000 people, involving ~60,000 workers, which translates to ~300,000 man-years.

Assuming it scales proportionally with population, manually performing the calculations to answer a simple Chinese query is about as hard as building a city with 1–2 million population.

Technique: DeepSeek, but I cut down its verbose answers.

Evaluation Results

7.4
Final Score
Surprise
7.7/10

Run 1:

Evaluation Score: 7/10

This Fermi model presents a surprisingly counterintuitive comparison between the computational effort of an LLM and the human effort required to build a city. Here's why this is quite surprising:

  1. Contradiction of Expectations: Most people, even in rationalist and EA communities, likely underestimate the sheer scale of computations performed by LLMs. Comparing it to a massive real-world project like city-building is unexpected and eye-opening.

  2. Counterintuitiveness: The idea that answering a simple query in Chinese could be equivalent to building an entire city is highly non-obvious. It reveals the hidden complexity of language models in a tangible way.

  3. Discovery of Unknowns: This model uncovers the vast gap between human and machine computational capabilities, which might not have been fully appreciated before. It highlights the efficiency of AI systems in a new light.

  4. Magnitude of Difference: The scale of the comparison is significant. We're not talking about a small task, but about building a city for millions of people, which emphasizes the enormous computational power of LLMs.

Specific surprising aspects:

  1. The calculation that a single token generation requires about 35,000 man-years of human computation is particularly striking.

  2. The fact that a simple question-answer pair in Chinese (15-40 tokens) equates to 500,000-1,500,000 man-years of work is almost difficult to comprehend.

  3. The comparison to Brasília's construction provides a concrete, historical reference point that makes the abstract numbers more relatable and surprising.

While the rationalist and EA communities are generally aware of the power of AI systems, this specific comparison to city-building provides a new and surprising perspective on the scale of AI computations. It's likely to challenge even informed individuals' intuitions about the computational demands of language models.

The score of 7/10 reflects that while these communities are somewhat prepared for surprising AI-related insights, the specific nature and scale of this comparison is likely to be notably surprising to most members.


Run 2:

Surprising Score: 8/10

This Fermi estimation model presents a highly surprising comparison between the computational effort required for an LLM to process a simple query and the human effort needed to build a large city from scratch. Here's why this is particularly surprising:

  1. Contradiction of Expectations: Many in the rationalist and EA communities are aware that LLMs require significant computational power, but the sheer scale of the human equivalent effort is likely to be much larger than most would intuitively expect. This dramatically illustrates the gap between human and machine computational capabilities.

  2. Counterintuitiveness: The comparison between language processing and city-building is not an obvious one, making the conclusion particularly striking. It reveals the hidden complexity of what appears to be a simple language task.

  3. Discovery of Unknowns: This model uncovers the astounding scale of computation happening "behind the scenes" in LLMs, which is often not fully appreciated. It provides a concrete, relatable measure of this scale.

  4. Magnitude of Difference: The fact that answering a simple query in Chinese could be equivalent to building a city for 1-2 million people is a significant deviation from what most would expect. The scale is orders of magnitude larger than what most would likely guess.

Specific surprising aspects:

  1. A single token generation requires about 35,000 man-years of human computational effort.
  2. A simple question and answer pair in Chinese (15-40 tokens) equates to 500,000-1,500,000 man-years of effort.
  3. This effort is comparable to building a city like Brasília, which took about 300,000 man-years for 500,000 people.

The model effectively bridges the gap between abstract computational power and tangible human effort, making the scale of LLM operations more comprehensible. This comparison is likely to significantly shift how members of these communities conceptualize the power and complexity of LLMs, earning it a high surprise score of 8/10.


Run 3:

Surprise Score: 8/10

This Fermi model presents a highly surprising comparison between the computational effort of a large language model (LLM) and the human effort required to build a modern city. Here's why this is particularly surprising:

  1. Contradiction of Expectations: Many in the rationalist and EA communities are familiar with the computational requirements of AI models, but the comparison to city-building is unexpected and challenges our intuitive sense of scale.

  2. Counterintuitiveness: The idea that answering a simple query in Chinese could be equivalent to building an entire city is deeply counterintuitive. It reveals hidden complexities in AI computation that are not immediately apparent.

  3. Discovery of Unknowns: This model uncovers a previously unrecognized way of conceptualizing the scale of AI computations. It provides a novel framework for understanding the immense computational power of LLMs.

  4. Magnitude of Difference: The sheer scale of the comparison - millions of man-years for a simple query - is striking. This vast difference from what one might expect makes the finding particularly surprising.

Specific details that illustrate the surprising aspects:

  1. The calculation that a single token generation requires about 35,000 man-years of human computation is startling.

  2. The comparison between 500,000-1,500,000 man-years for a simple query-answer pair and 150,000-300,000 man-years for building a city for 1-2 million people is unexpectedly close.

  3. The validation using the real-world example of Brasília adds credibility to the city-building estimate, making the comparison even more striking.

This model provides a novel and surprising way to conceptualize the computational power of LLMs, which is likely to challenge and expand the thinking of many in the rationalist and EA communities. The unexpected nature of the comparison and the vast scale involved contribute to its high surprise factor.

Topic Relevance
7.5/10

Run 1:

To evaluate this Fermi model's importance to the rationalist and effective altruism communities, I'll consider its relevance and impact potential:

Relevance: 8/10 This model addresses a core topic of interest to these communities: the capabilities and nature of artificial intelligence, particularly large language models (LLMs). It provides a concrete comparison between AI computational power and human effort, which is highly relevant to discussions about AI capabilities, AI safety, and the potential for transformative AI.

Impact Potential: 7/10 The findings could influence thinking and priority-setting within these communities in several ways:

  1. It provides a striking illustration of the computational advantage of AI over humans, which could inform discussions about AI risk and the potential for rapid AI advancement.
  2. The comparison to city-building offers a tangible, relatable scale for the computational power of LLMs, which could be useful in communicating AI concepts to a broader audience.
  3. It might influence resource allocation decisions in AI research and safety efforts by highlighting the vast difference in computational efficiency between humans and AI.

Overall Score: 7.5/10

This score reflects that the model addresses a highly relevant topic for these communities and provides insights that could meaningfully impact discussions and decision-making. However, it falls short of a higher score because:

  1. It focuses on current AI capabilities rather than potential future developments, which are often of greater concern in these communities.
  2. The specific comparison (Chinese language processing vs. city building) might not directly inform the most pressing questions in AI alignment and safety.
  3. While interesting, the model doesn't provide entirely new information that would drastically shift existing paradigms within these communities.

Nevertheless, this model offers a valuable perspective that aligns well with the interests and goals of the rationalist and effective altruism communities, justifying its above-average importance rating.


Run 2:

To evaluate this Fermi model's importance to the rationalist and effective altruism communities, I'll consider its relevance and impact potential.

Relevance: 8/10 This model addresses a core interest of these communities: understanding and quantifying the capabilities of artificial intelligence, particularly large language models (LLMs). It provides a concrete comparison between human and AI computational power, which is highly relevant to discussions about AI capabilities, risks, and potential impacts. The use of Chinese in the example also touches on cross-cultural AI applications, another area of interest.

Impact Potential: 7/10 The findings could significantly influence thinking about:

  1. AI capabilities and the scale of computation involved
  2. The efficiency gap between human and machine intelligence
  3. Resource allocation for AI development vs. traditional infrastructure projects
  4. Potential societal impacts of widespread AI adoption

The concrete comparison to city-building provides a visceral understanding of the computational power of LLMs, which could be impactful in discussions and decision-making about AI development and governance.

Overall Score: 7.5/10

This score is above the median expectation of 5, reflecting the model's high relevance and substantial impact potential for the rationalist and EA communities. The model provides a novel and thought-provoking way to conceptualize AI capabilities, which is likely to stimulate valuable discussions and potentially influence priority-setting in AI research and policy.


Run 3:

To evaluate this Fermi model's importance to the rationalist and effective altruism communities, I'll consider its relevance and impact potential:

Relevance: 8/10 This model addresses a core topic in AI alignment and capabilities assessment, which are central concerns for both rationalists and effective altruists. It provides a tangible comparison between human and AI capabilities, helping to contextualize the scale of AI's computational power. This type of analysis is crucial for understanding the current state and potential future impacts of AI systems.

Impact Potential: 7/10 The findings could significantly influence how these communities think about and prioritize AI research and safety measures. By illustrating the vast computational gap between humans and AI in a relatable way, it could:

  1. Motivate increased urgency in AI alignment efforts
  2. Inform discussions on AI governance and regulation
  3. Shape strategies for AI development and deployment
  4. Influence resource allocation within EA for AI-related causes

The model's clear and striking comparison makes it more likely to be shared and discussed widely within these communities, potentially reaching decision-makers and influencing policy discussions.

Overall Score: 7.5/10

This score reflects that the model addresses a highly relevant topic for these communities and has significant potential to impact thinking and decision-making. It's above the median expectation due to its direct engagement with core AI capabilities assessment, which is a central concern for both rationalists and effective altruists. The concrete, relatable comparison it provides makes it particularly valuable for communicating the scale of AI capabilities to a wider audience within these communities.

Robustness
6.7/10

Run 1:

Evaluation of the Fermi Model:

Score: 6/10

Justification:

  1. Sensitivity to Assumptions: The model relies on several key assumptions that could significantly impact the results if altered. These include:
  • The human capacity to perform MAC operations (1 every 5 seconds)
  • The number of MAC operations per token for LLama 3 8B
  • The number of tokens in a simple Chinese question and answer pair
  • The labor requirements and timelines for city construction

While these assumptions seem reasonable, small changes could lead to large differences in the final comparison. This sensitivity somewhat reduces the model's robustness.

  1. Evidence Base: The model uses some credible data points, such as:
  • The parameters of the LLama 3 8B model
  • Estimates of full-time work hours per year
  • The example of Brasília's construction

However, more recent and specific data on LLM computational requirements and modern city-building efforts would strengthen the evidence base.

  1. Methodological Rigor: The model demonstrates good methodological rigor by:
  • Breaking down complex processes into smaller, estimable components
  • Providing ranges rather than single point estimates
  • Cross-checking estimates with a real-world example (Brasília)

The approach is logical and transparent, which adds to its robustness.

  1. Consensus of Assumptions: Many of the assumptions used in this model would likely be accepted within the rationalist and EA communities, such as:
  • The computational intensity of LLMs
  • The general scale of effort required for city-building
  • The approach of breaking down complex tasks into more manageable estimates

However, some specific numbers (like MAC operations per second for humans) might be debated.

Strengths:

  • Clear and logical breakdown of both LLM operations and city-building processes
  • Use of ranges to account for uncertainty
  • Validation against a real-world example (Brasília)
  • Consideration of various aspects of city-building (infrastructure, residential/commercial construction)

Weaknesses:

  • High sensitivity to initial assumptions, particularly in the LLM calculation
  • Lack of recent, specific data on LLM computational requirements
  • Potential oversimplification of the complexity of LLM operations
  • Limited consideration of technological advancements in city-building since the 1950s

Overall, the model provides a thought-provoking comparison and uses sound reasoning. Its main strengths lie in its methodological approach and the attempt to validate with real-world data. However, its reliance on several key assumptions that could significantly alter the results if changed limits its robustness. The score of 6/10 reflects a model that is above average in robustness but still has room for improvement in terms of data sources and sensitivity analysis.


Run 2:

Evaluation of the Fermi Model:

Score: 7/10

Justification:

  1. Sensitivity to Assumptions: The model relies on several key assumptions, but most are reasonably justified or conservative. The estimate of human computation speed (1 MAC per 5 seconds) is a critical assumption, but it's likely on the conservative side, which strengthens the model's robustness. The estimation of MAC operations for LLM token generation is well-reasoned, considering parameter count and additional computations. The model's conclusion is not overly sensitive to small changes in these assumptions, which adds to its robustness.

  2. Evidence Base: The model draws from credible sources and up-to-date information on LLM architectures (specifically Llama 3 8B). The city-building comparison uses real-world data from the construction of Brasília, which provides a solid empirical foundation. However, more diverse examples of city construction could further strengthen the evidence base.

  3. Methodological Rigor: The model demonstrates strong methodological rigor. It breaks down complex processes into manageable components, uses conservative estimates where uncertainty exists, and provides clear step-by-step calculations. The comparison between LLM computation and city-building is creative and well-structured. The model also acknowledges ranges of uncertainty (e.g., 15-40 tokens for a question-answer pair), which adds to its credibility.

  4. Consensus of Assumptions: Most of the assumptions used in this model align well with commonly accepted views in the rationalist and EA communities. The focus on computational complexity, the use of man-years as a measure of effort, and the comparison to large-scale human endeavors are all approaches that would likely find acceptance in these communities.

Strengths:

  • Well-structured and clearly explained calculations
  • Use of real-world data (Brasília construction) to validate estimates
  • Conservative estimates that strengthen the robustness of conclusions
  • Creative and thought-provoking comparison between AI computation and human construction efforts

Weaknesses:

  • Reliance on a single example (Brasília) for city-building comparison
  • Potential oversimplification of LLM architecture complexities
  • Limited exploration of potential variations in human computation speed

Overall, this model demonstrates a high degree of robustness. Its conservative assumptions, clear methodology, and use of real-world data to validate estimates all contribute to its strength. While there are areas where additional data or more nuanced analysis could further improve the model, its current form provides a compelling and well-reasoned comparison that is likely to be resilient to moderate scrutiny or variations in its assumptions.


Run 3:

Evaluation of the Fermi Model:

Score: 7/10

Justification:

  1. Sensitivity to Assumptions: The model relies on several key assumptions, but most are reasonably conservative or based on credible estimates. The assumption of 1 MAC operation every 5 seconds for a human is a significant simplification but serves as a reasonable baseline. The estimate of MAC operations for LLama 3 8B is well-justified and includes consideration for additional computational overhead. The model's sensitivity to these assumptions is moderate, as changes would scale the result proportionally without fundamentally altering the conclusion.

  2. Evidence Base: The model draws from credible technical specifications (e.g., LLama 3 8B parameters) and real-world examples (construction of Brasília). The use of ranges for several estimates (e.g., tokens in a Chinese question-answer pair, years for infrastructure construction) acknowledges uncertainty and improves robustness. However, more recent or diverse examples of city construction could strengthen the evidence base.

  3. Methodological Rigor: The model demonstrates strong methodological rigor. It breaks down complex processes into manageable components, uses conservative estimates, and provides clear calculations. The comparison between LLM computation and city construction is creative and well-structured. The model acknowledges uncertainties and provides ranges where appropriate. However, it could benefit from more explicit discussion of potential biases or limitations.

  4. Consensus of Assumptions: Most assumptions used in this model align well with commonly accepted figures in the AI and urban development fields. The estimate of human computation speed, while simplified, is reasonable. The breakdown of city construction phases and labor requirements is generally consistent with urban planning literature.

Strengths:

  • Well-structured comparison between two complex systems
  • Use of ranges to account for uncertainties
  • Grounding in real-world examples and technical specifications
  • Conservative estimates that likely don't overstate the conclusion

Weaknesses:

  • Reliance on a single historical example for city construction validation
  • Simplified model of human computation capabilities
  • Limited discussion of potential biases or limitations

Overall, this model demonstrates strong robustness in its approach and conclusions. While there are areas where it could be refined or expanded, the core methodology and reasoning are sound, and the conclusion is likely to hold under reasonable variations in the assumptions. The creative comparison between LLM computation and city construction provides a compelling perspective on the scale of AI computations.

Model Quality
7.7/10

Run 1:

Thank you for providing this Fermi model for evaluation. I'll assess it based on the criteria you've outlined and provide a numeric score along with my observations.

Evaluation:

  1. Comprehensiveness: 8/10 The model covers key factors in both the LLM computation and city-building scenarios. It considers the computational requirements of an LLM and breaks down city construction into major components. However, it could be more comprehensive in detailing other aspects of city-building (e.g., public services, recreational facilities).

  2. Data Integration: 7/10 The model uses reasonable estimates for LLM computations and human capabilities. The reference to Brasília's construction provides a real-world validation point. However, more references to established studies or data sources would strengthen the model.

  3. Clarity of Assumptions: 8/10 Assumptions are clearly stated throughout the model, such as human computation speed, LLM parameter estimates, and working hours per year. The model distinguishes between empirical data (e.g., Brasília example) and estimates.

  4. Transparency and Replicability: 9/10 The step-by-step calculations are well-documented, making the model highly transparent and replicable. Each stage of the calculation is clearly shown.

  5. Logical Consistency: 9/10 The model follows a logical structure, starting from basic computations and building up to larger estimates. The reasoning is coherent and easy to follow.

  6. Communication: 7/10 The findings are clearly communicated through text. However, the model could benefit from visual aids such as charts or graphs to enhance understanding, especially for comparing the LLM computation and city-building efforts.

  7. Practical Relevance: 8/10 The model provides an interesting perspective on the computational complexity of LLMs by comparing it to a tangible real-world task. This insight could be valuable for stakeholders in AI research and development.

Overall Score: 8/10

This Fermi model demonstrates strong qualities in its construction and presentation. Its strengths lie in its logical consistency, transparency, and clear statement of assumptions. The comparison between LLM computations and city-building is creative and helps to contextualize the scale of AI operations.

Areas for improvement include:

  1. Incorporating more diverse data sources and references
  2. Adding visual representations of the data and comparisons
  3. Expanding on the city-building aspects to make the comparison more comprehensive

Despite these potential improvements, the model is well above the median expectation, providing a clear, logically sound, and insightful analysis of the computational demands of LLMs in relatable human terms.


Run 2:

Thank you for providing this Fermi model for evaluation. I'll assess it based on the criteria you've outlined and provide a numeric score along with my observations.

Score: 7/10

Observations:

  1. Comprehensiveness: The model addresses key factors in both the LLM computation and city-building scenarios. It considers the computational requirements of an LLM and major aspects of city construction. However, it could be more comprehensive in detailing the LLM's architecture and city-building complexities.

  2. Data Integration: The model uses some relevant data points, such as the Llama 3 8B parameters and the Brasília construction example. However, it could benefit from more diverse and recent data sources, especially for the city-building estimates.

  3. Clarity of Assumptions: Assumptions are generally clear, such as the human computation rate and work hours. However, some assumptions (like the 2 MAC operations per parameter) could use more justification.

  4. Transparency and Replicability: The calculation process is mostly transparent, making it possible to follow and potentially replicate. However, some steps (like the estimation of total MAC operations per token) could be more explicitly detailed.

  5. Logical Consistency: The model follows a logical structure, progressing from individual computations to token generation, and then to city-building comparisons. The reasoning is coherent throughout.

  6. Communication: The model presents its findings clearly, with a good balance of numerical data and explanations. However, it could benefit from visual aids like charts or diagrams to enhance understanding.

  7. Practical Relevance: The model provides an interesting comparison between computational effort and physical construction, which could be useful for conceptualizing the scale of AI computations. However, its practical applications might be limited.

Strengths:

  • Clear step-by-step calculations
  • Interesting and creative comparison between AI computation and city-building
  • Use of real-world examples (Brasília) to validate estimates

Areas for Improvement:

  • More diverse and recent data sources
  • More detailed justification for some assumptions
  • Visual aids to enhance understanding
  • More comprehensive exploration of LLM architecture and city-building complexities

Overall, this is an above-average Fermi model that presents an intriguing comparison in a largely clear and logical manner. While there's room for improvement in terms of comprehensiveness and data integration, it succeeds in providing a thought-provoking perspective on the scale of AI computations.


Run 3:

Thank you for providing this Fermi model for evaluation. I'll assess it based on the criteria you've outlined and provide a numeric score along with my observations.

Evaluation:

  1. Comprehensiveness (8/10): The model covers key factors in both the LLM computation and city-building scenarios. It addresses computational requirements, human capabilities, and major aspects of urban development. However, it could be more comprehensive in detailing the LLM's architecture or considering more varied city-building scenarios.

  2. Data Integration (7/10): The model uses reasonable estimates for human computation speed, work hours, and LLM parameters. The city-building example of Brasília provides a good reference point. However, more citations or references to established studies would strengthen the data integrity.

  3. Clarity of Assumptions (8/10): Assumptions are generally clear and justified, such as the estimate of MAC operations per token and the breakdown of city-building phases. The model distinguishes between empirical data (e.g., Brasília example) and estimates.

  4. Transparency and Replicability (9/10): The step-by-step calculations make the model highly transparent and replicable. Others could easily audit or replicate the results using the provided methodology.

  5. Logical Consistency (9/10): The model follows a logical structure, coherently reasoning from premises to conclusions in both the LLM and city-building scenarios.

  6. Communication (7/10): The findings are clearly communicated with a good balance of technical detail and explanation. However, the model could benefit from visual aids like charts or graphs to enhance understanding, especially for comparing the two scenarios.

  7. Practical Relevance (8/10): The model provides an interesting perspective on the computational complexity of LLMs by comparing it to a tangible real-world task. This insight could be valuable for stakeholders in AI research and development.

Specific Observations:

  • The comparison between LLM computation and city-building is novel and thought-provoking.
  • The breakdown of city-building phases is well-detailed and logical.
  • The model effectively scales estimates based on population size.
  • The use of the Brasília example adds credibility to the city-building estimates.
  • The model could benefit from more discussion on the limitations of the comparison or potential variations in scenarios.

Overall Score: 8/10

This Fermi model demonstrates above-average quality in its construction and presentation. It excels in logical consistency and transparency, providing a clear and replicable methodology. The novel comparison between LLM computation and city-building offers an interesting perspective on computational complexity. While there's room for improvement in areas like visual presentation and more extensive data integration, the model effectively communicates its findings and provides valuable insights into the scale of LLM operations.