🔧 Build Log: GPT-Powered Research Assistant

Step-by-step walkthrough of building an AI research assistant that automatically curates, analyzes, and organizes academic papers using GPT-4, Airtable, and Apify.
Project: Automated AI research paper curation and analysis system
Tech Stack: GPT-4, Airtable, Apify, Google Cloud Functions, Python
Status: MVP complete, iterating on analysis quality
Build Time: ~20 hours over 2 weeks
The Problem
I was drowning in AI research papers. arXiv alone publishes 50+ AI papers daily, and manually tracking what’s important was becoming impossible. I needed a system that could:
- Monitor key research sources automatically
- Filter papers by relevance to my interests
- Analyze papers for key insights and practical implications
- Organize findings in a searchable, actionable format
Solution Architecture
arXiv/Papers → Apify Scraper → GPT-4 Analysis → Airtable Database → Alerts/Summaries
Why This Stack:
- Airtable: Perfect for structured research data with rich fields
- Apify: Reliable web scraping with built-in scheduling
- GPT-4: Best available model for research analysis and summarization
- Google Cloud Functions: Serverless orchestration and cost control
Implementation Walkthrough
Step 1: Setting Up Airtable Base
Created a research database with these key fields:
Papers Table:
- Title (Single line text)
- Authors (Multiple select)
- Abstract (Long text)
- PDF URL (URL)
- arXiv ID (Single line text)
- Publication Date (Date)
- Research Area (Multiple select: LLM, Agents, Scaling, etc.)
- Relevance Score (Number 1-10)
- Key Insights (Long text)
- Practical Implications (Long text)
- Status (Select: New, Analyzed, Archived)
- Tags (Multiple select)
Key Design Decision: Separate analysis fields from paper metadata to track GPT-4’s interpretations alongside raw data.
Step 2: Apify Web Scraper Configuration
Built a custom actor to monitor research sources:
# Simplified scraper logic
import requests
from datetime import datetime, timedelta
def scrape_arxiv_cs_ai():
"""Scrape recent CS.AI papers from arXiv"""
yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y%m%d')
query_params = {
'search_query': 'cat:cs.AI OR cat:cs.LG OR cat:cs.CL',
'start': 0,
'max_results': 100,
'sortBy': 'submittedDate',
'sortOrder': 'descending'
}
response = requests.get('http://export.arxiv.org/api/query', params=query_params)
# Parse XML response and extract paper metadata
return parsed_papers
Scheduling: Runs daily at 8 AM to catch new submissions
Step 3: GPT-4 Analysis Pipeline
The most critical component—getting consistent, useful analysis from GPT-4:
def analyze_paper(title, abstract, authors):
"""Analyze paper relevance and extract insights"""
prompt = f"""
Analyze this AI research paper for practical implications:
Title: {title}
Authors: {authors}
Abstract: {abstract}
Provide:
1. Relevance Score (1-10) for AI practitioners focused on agents/automation
2. Key Technical Insights (2-3 bullet points)
3. Practical Implications (how could this be applied?)
4. Research Area Tags (choose from: LLM, Agents, Scaling, Training, etc.)
Format as JSON for easy parsing.
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.1 # Low temperature for consistent analysis
)
return json.loads(response.choices[0].message.content)
Prompt Engineering Notes:
- Specific scoring criteria prevent score inflation
- JSON format enables automated data entry
- Low temperature improves consistency
- Examples in prompt improve analysis quality
Step 4: Airtable Integration
Used Airtable’s API to store analyzed papers:
def store_paper_analysis(paper_data, analysis):
"""Store paper and analysis in Airtable"""
record = {
'Title': paper_data['title'],
'Authors': paper_data['authors'],
'Abstract': paper_data['abstract'],
'PDF URL': paper_data['pdf_url'],
'arXiv ID': paper_data['arxiv_id'],
'Publication Date': paper_data['pub_date'],
'Relevance Score': analysis['relevance_score'],
'Key Insights': analysis['insights'],
'Practical Implications': analysis['implications'],
'Tags': analysis['tags'],
'Status': 'New'
}
airtable_client.create(record)
Step 5: Cloud Function Orchestration
Google Cloud Function that orchestrates the entire pipeline:
def research_assistant_pipeline(request):
"""Main pipeline function"""
# 1. Trigger Apify scraper
new_papers = apify_client.call_actor('my-arxiv-scraper')
# 2. Filter for unprocessed papers
unprocessed = filter_new_papers(new_papers)
# 3. Analyze each paper with GPT-4
for paper in unprocessed:
analysis = analyze_paper(paper['title'], paper['abstract'], paper['authors'])
# 4. Store in Airtable if relevance score > 6
if analysis['relevance_score'] >= 6:
store_paper_analysis(paper, analysis)
# 5. Generate weekly summary
if is_friday():
generate_weekly_summary()
return {'status': 'success', 'papers_processed': len(unprocessed)}
Results & Metrics
After 2 weeks of operation:
Performance:
- Papers Processed: 347 total papers analyzed
- High-Relevance Papers: 23 papers scored 8+ (6.6% pass rate)
- Processing Time: ~30 seconds per paper
- Cost: $12.50 in OpenAI API calls
- Accuracy: 85% of high-scored papers were genuinely relevant (manual verification)
Unexpected Insights:
- GPT-4 is consistently conservative with scoring (good thing)
- Abstract quality varies dramatically across papers
- Certain research groups produce consistently high-quality work
- Weekend submissions tend to be lower quality
Lessons Learned
What Worked Well:
- JSON-formatted prompts made data integration seamless
- Conservative relevance scoring reduced noise significantly
- Airtable’s rich fields perfect for research data
- Daily processing kept the system manageable
What I’d Do Differently:
- Add semantic search across stored papers for better discovery
- Include citation analysis to identify important papers early
- Build feedback loop to improve GPT-4 analysis over time
- Add automated social sharing for high-relevance papers
Scaling Challenges:
- API rate limits become an issue with more papers
- GPT-4 costs could be significant at larger scale
- Storage costs in Airtable with thousands of papers
- Analysis consistency degrades with batch processing
Next Iterations
Phase 2 Features:
- Citation Network Analysis - Track paper influence and connections
- Researcher Following - Monitor specific authors automatically
- Conference Paper Integration - Expand beyond arXiv to venue papers
- Smart Alerts - Notify when papers match specific interests
Technical Improvements:
- Vector embeddings for better semantic matching
- Fine-tuned model for research analysis (cheaper than GPT-4)
- Incremental processing to handle larger volumes
- Quality feedback loops based on manual relevance ratings
Code Repository
Full implementation available at: GitHub - AI Research Assistant
Includes:
- Apify scraper configuration
- Cloud Function deployment scripts
- Airtable schema definitions
- GPT-4 prompt templates
- Cost optimization strategies
Next Build Log: Setting up automated CRM → BigQuery pipelines with Cloud Functions and error handling.