Add project files:

- Add database initialization scripts
- Add configuration files
- Add documentation
- Add public assets
- Add source code structure
- Update README
This commit is contained in:
eligrinfeld 2025-01-04 17:22:46 -07:00
parent 372943801d
commit fde5b5e318
39 changed files with 10099 additions and 187 deletions

11
.env.example Normal file
View file

@ -0,0 +1,11 @@
PORT=3000
NODE_ENV=development
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_key
OLLAMA_URL=http://localhost:11434
OLLAMA_MODEL=llama2
SEARXNG_URL=http://localhost:4000
SEARXNG_INSTANCES=["http://localhost:4000"]
MAX_RESULTS_PER_QUERY=50
CACHE_DURATION_HOURS=24
CACHE_DURATION_DAYS=7

234
README.md
View file

@ -1,178 +1,120 @@
# 🚀 Perplexica - An AI-powered search engine 🔎 <!-- omit in toc -->
# BizSearch
[![Discord](https://dcbadge.vercel.app/api/server/26aArMy8tT?style=flat&compact=true)](https://discord.gg/26aArMy8tT)
A tool for finding and analyzing local businesses using AI-powered data extraction.
## Prerequisites
![preview](.assets/perplexica-screenshot.png?)
## Table of Contents <!-- omit in toc -->
- [Overview](#overview)
- [Preview](#preview)
- [Features](#features)
- [Installation](#installation)
- [Getting Started with Docker (Recommended)](#getting-started-with-docker-recommended)
- [Non-Docker Installation](#non-docker-installation)
- [Ollama Connection Errors](#ollama-connection-errors)
- [Using as a Search Engine](#using-as-a-search-engine)
- [Using Perplexica's API](#using-perplexicas-api)
- [Expose Perplexica to a network](#expose-perplexica-to-network)
- [One-Click Deployment](#one-click-deployment)
- [Upcoming Features](#upcoming-features)
- [Support Us](#support-us)
- [Donations](#donations)
- [Contribution](#contribution)
- [Help and Support](#help-and-support)
## Overview
Perplexica is an open-source AI-powered searching tool or an AI-powered search engine that goes deep into the internet to find answers. Inspired by Perplexity AI, it's an open-source option that not just searches the web but understands your questions. It uses advanced machine learning algorithms like similarity searching and embeddings to refine results and provides clear answers with sources cited.
Using SearxNG to stay current and fully open source, Perplexica ensures you always get the most up-to-date information without compromising your privacy.
Want to know more about its architecture and how it works? You can read it [here](https://github.com/ItzCrazyKns/Perplexica/tree/master/docs/architecture/README.md).
## Preview
![video-preview](.assets/perplexica-preview.gif)
## Features
- **Local LLMs**: You can make use local LLMs such as Llama3 and Mixtral using Ollama.
- **Two Main Modes:**
- **Copilot Mode:** (In development) Boosts search by generating different queries to find more relevant internet sources. Like normal search instead of just using the context by SearxNG, it visits the top matches and tries to find relevant sources to the user's query directly from the page.
- **Normal Mode:** Processes your query and performs a web search.
- **Focus Modes:** Special modes to better answer specific types of questions. Perplexica currently has 6 focus modes:
- **All Mode:** Searches the entire web to find the best results.
- **Writing Assistant Mode:** Helpful for writing tasks that does not require searching the web.
- **Academic Search Mode:** Finds articles and papers, ideal for academic research.
- **YouTube Search Mode:** Finds YouTube videos based on the search query.
- **Wolfram Alpha Search Mode:** Answers queries that need calculations or data analysis using Wolfram Alpha.
- **Reddit Search Mode:** Searches Reddit for discussions and opinions related to the query.
- **Current Information:** Some search tools might give you outdated info because they use data from crawling bots and convert them into embeddings and store them in a index. Unlike them, Perplexica uses SearxNG, a metasearch engine to get the results and rerank and get the most relevant source out of it, ensuring you always get the latest information without the overhead of daily data updates.
- **API**: Integrate Perplexica into your existing applications and make use of its capibilities.
It has many more features like image and video search. Some of the planned features are mentioned in [upcoming features](#upcoming-features).
- Node.js 16+
- Ollama (for local LLM)
- SearxNG instance
## Installation
There are mainly 2 ways of installing Perplexica - With Docker, Without Docker. Using Docker is highly recommended.
1. Install Ollama:
```bash
# On macOS
brew install ollama
```
### Getting Started with Docker (Recommended)
2. Start Ollama:
```bash
# Start and enable on login
brew services start ollama
1. Ensure Docker is installed and running on your system.
2. Clone the Perplexica repository:
# Or run without auto-start
/usr/local/opt/ollama/bin/ollama serve
```
```bash
git clone https://github.com/ItzCrazyKns/Perplexica.git
```
3. Pull the required model:
```bash
ollama pull mistral
```
3. After cloning, navigate to the directory containing the project files.
4. Clone and set up the project:
```bash
git clone https://github.com/yourusername/bizsearch.git
cd bizsearch
npm install
```
4. Rename the `sample.config.toml` file to `config.toml`. For Docker setups, you need only fill in the following fields:
5. Configure environment:
```bash
cp .env.example .env
# Edit .env with your settings
```
- `OPENAI`: Your OpenAI API key. **You only need to fill this if you wish to use OpenAI's models**.
- `OLLAMA`: Your Ollama API URL. You should enter it as `http://host.docker.internal:PORT_NUMBER`. If you installed Ollama on port 11434, use `http://host.docker.internal:11434`. For other ports, adjust accordingly. **You need to fill this if you wish to use Ollama's models instead of OpenAI's**.
- `GROQ`: Your Groq API key. **You only need to fill this if you wish to use Groq's hosted models**.
- `ANTHROPIC`: Your Anthropic API key. **You only need to fill this if you wish to use Anthropic models**.
6. Start the application:
```bash
npm run dev
```
**Note**: You can change these after starting Perplexica from the settings dialog.
7. Open http://localhost:3000 in your browser
- `SIMILARITY_MEASURE`: The similarity measure to use (This is filled by default; you can leave it as is if you are unsure about it.)
## Troubleshooting
5. Ensure you are in the directory containing the `docker-compose.yaml` file and execute:
If Ollama fails to start:
```bash
# Stop any existing instance
brew services stop ollama
# Wait a few seconds
sleep 5
# Start again
brew services start ollama
```
```bash
docker compose up -d
```
To verify Ollama is running:
```bash
curl http://localhost:11434/api/version
```
6. Wait a few minutes for the setup to complete. You can access Perplexica at http://localhost:3000 in your web browser.
## Features
**Note**: After the containers are built, you can start Perplexica directly from Docker without having to open a terminal.
- Business search with location filtering
- Contact information extraction
- AI-powered data validation
- Clean, user-friendly interface
- Service health monitoring
### Non-Docker Installation
## Configuration
1. Install SearXNG and allow `JSON` format in the SearXNG settings.
2. Clone the repository and rename the `sample.config.toml` file to `config.toml` in the root directory. Ensure you complete all required fields in this file.
3. Rename the `.env.example` file to `.env` in the `ui` folder and fill in all necessary fields.
4. After populating the configuration and environment files, run `npm i` in both the `ui` folder and the root directory.
5. Install the dependencies and then execute `npm run build` in both the `ui` folder and the root directory.
6. Finally, start both the frontend and the backend by running `npm run start` in both the `ui` folder and the root directory.
Key environment variables:
- `SEARXNG_URL`: Your SearxNG instance URL
- `OLLAMA_URL`: Ollama API endpoint (default: http://localhost:11434)
- `SUPABASE_URL`: Your Supabase project URL
- `SUPABASE_ANON_KEY`: Your Supabase anonymous key
- `CACHE_DURATION_DAYS`: How long to cache results (default: 7)
**Note**: Using Docker is recommended as it simplifies the setup process, especially for managing environment variables and dependencies.
## Supabase Setup
See the [installation documentation](https://github.com/ItzCrazyKns/Perplexica/tree/master/docs/installation) for more information like exposing it your network, etc.
1. Create a new Supabase project
2. Run the SQL commands in `db/init.sql` to create the cache table
3. Copy your project URL and anon key to `.env`
### Ollama Connection Errors
## License
If you're encountering an Ollama connection error, it is likely due to the backend being unable to connect to Ollama's API. To fix this issue you can:
MIT
1. **Check your Ollama API URL:** Ensure that the API URL is correctly set in the settings menu.
2. **Update API URL Based on OS:**
## Cache Management
- **Windows:** Use `http://host.docker.internal:11434`
- **Mac:** Use `http://host.docker.internal:11434`
- **Linux:** Use `http://<private_ip_of_host>:11434`
The application uses Supabase for caching search results. Cache entries expire after 7 days.
Adjust the port number if you're using a different one.
### Manual Cache Cleanup
3. **Linux Users - Expose Ollama to Network:**
If automatic cleanup is not available, you can manually clean up expired entries:
- Inside `/etc/systemd/system/ollama.service`, you need to add `Environment="OLLAMA_HOST=0.0.0.0"`. Then restart Ollama by `systemctl restart ollama`. For more information see [Ollama docs](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux)
1. Using the API:
```bash
curl -X POST http://localhost:3000/api/cleanup
```
- Ensure that the port (default is 11434) is not blocked by your firewall.
2. Using SQL:
```sql
select manual_cleanup();
```
## Using as a Search Engine
### Cache Statistics
If you wish to use Perplexica as an alternative to traditional search engines like Google or Bing, or if you want to add a shortcut for quick access from your browser's search bar, follow these steps:
1. Open your browser's settings.
2. Navigate to the 'Search Engines' section.
3. Add a new site search with the following URL: `http://localhost:3000/?q=%s`. Replace `localhost` with your IP address or domain name, and `3000` with the port number if Perplexica is not hosted locally.
4. Click the add button. Now, you can use Perplexica directly from your browser's search bar.
## Using Perplexica's API
Perplexica also provides an API for developers looking to integrate its powerful search engine into their own applications. You can run searches, use multiple models and get answers to your queries.
For more details, check out the full documentation [here](https://github.com/ItzCrazyKns/Perplexica/tree/master/docs/API/SEARCH.md).
## Expose Perplexica to network
You can access Perplexica over your home network by following our networking guide [here](https://github.com/ItzCrazyKns/Perplexica/blob/master/docs/installation/NETWORKING.md).
## One-Click Deployment
[![Deploy to RepoCloud](https://d16t0pc4846x52.cloudfront.net/deploylobe.svg)](https://repocloud.io/details/?app_id=267)
## Upcoming Features
- [x] Add settings page
- [x] Adding support for local LLMs
- [x] History Saving features
- [x] Introducing various Focus Modes
- [x] Adding API support
- [x] Adding Discover
- [ ] Finalizing Copilot Mode
## Support Us
If you find Perplexica useful, consider giving us a star on GitHub. This helps more people discover Perplexica and supports the development of new features. Your support is greatly appreciated.
### Donations
We also accept donations to help sustain our project. If you would like to contribute, you can use the following options to donate. Thank you for your support!
| Ethereum |
| ----------------------------------------------------- |
| Address: `0xB025a84b2F269570Eb8D4b05DEdaA41D8525B6DD` |
## Contribution
Perplexica is built on the idea that AI and large language models should be easy for everyone to use. If you find bugs or have ideas, please share them in via GitHub Issues. For more information on contributing to Perplexica you can read the [CONTRIBUTING.md](CONTRIBUTING.md) file to learn more about Perplexica and how you can contribute to it.
## Help and Support
If you have any questions or feedback, please feel free to reach out to us. You can create an issue on GitHub or join our Discord server. There, you can connect with other users, share your experiences and reviews, and receive more personalized help. [Click here](https://discord.gg/EFwsmQDgAu) to join the Discord server. To discuss matters outside of regular support, feel free to contact me on Discord at `itzcrazykns`.
Thank you for exploring Perplexica, the AI-powered search engine designed to enhance your search experience. We are constantly working to improve Perplexica and expand its capabilities. We value your feedback and contributions which help us make Perplexica even better. Don't forget to check back for updates and new features!
View cache statistics using:
```sql
select * from cache_stats;
```

171
db/init.sql Normal file
View file

@ -0,0 +1,171 @@
-- Enable required extensions
create extension if not exists "uuid-ossp"; -- For UUID generation
create extension if not exists pg_cron; -- For scheduled jobs
-- Create the search_cache table
create table public.search_cache (
id uuid default uuid_generate_v4() primary key,
query text not null,
results jsonb not null,
location text not null,
category text not null,
created_at timestamp with time zone default timezone('utc'::text, now()) not null,
updated_at timestamp with time zone default timezone('utc'::text, now()) not null,
expires_at timestamp with time zone default timezone('utc'::text, now() + interval '7 days') not null
);
-- Create indexes
create index search_cache_query_idx on public.search_cache (query);
create index search_cache_location_category_idx on public.search_cache (location, category);
create index search_cache_expires_at_idx on public.search_cache (expires_at);
-- Enable RLS
alter table public.search_cache enable row level security;
-- Create policies
create policy "Allow public read access"
on public.search_cache for select
using (true);
create policy "Allow service write access"
on public.search_cache for insert
with check (true);
create policy "Allow service update access"
on public.search_cache for update
using (true)
with check (true);
create policy "Allow delete expired records"
on public.search_cache for delete
using (expires_at < now());
-- Create function to clean up expired records
create or replace function cleanup_expired_cache()
returns void
language plpgsql
security definer
as $$
begin
delete from public.search_cache
where expires_at < now();
end;
$$;
-- Create a manual cleanup function since pg_cron might not be available
create or replace function manual_cleanup()
returns void
language plpgsql
security definer
as $$
begin
delete from public.search_cache
where expires_at < now();
end;
$$;
-- Create a view for cache statistics
create or replace view cache_stats as
select
count(*) as total_entries,
count(*) filter (where expires_at < now()) as expired_entries,
count(*) filter (where expires_at >= now()) as valid_entries,
min(created_at) as oldest_entry,
max(created_at) as newest_entry,
count(distinct category) as unique_categories,
count(distinct location) as unique_locations
from public.search_cache;
-- Grant permissions to access the view
grant select on cache_stats to postgres;
-- Create table if not exists businesses
create table if not exists businesses (
id text primary key,
name text not null,
phone text,
email text,
address text,
rating numeric,
website text,
logo text,
source text,
description text,
latitude numeric,
longitude numeric,
last_updated timestamp with time zone default timezone('utc'::text, now()),
search_count integer default 1,
created_at timestamp with time zone default timezone('utc'::text, now())
);
-- Create indexes for common queries
create index if not exists businesses_name_idx on businesses (name);
create index if not exists businesses_rating_idx on businesses (rating desc);
create index if not exists businesses_search_count_idx on businesses (search_count desc);
create index if not exists businesses_last_updated_idx on businesses (last_updated desc);
-- Create tables if they don't exist
CREATE TABLE IF NOT EXISTS businesses (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
phone TEXT,
email TEXT,
address TEXT,
rating INTEGER,
website TEXT,
logo TEXT,
source TEXT,
description TEXT,
location JSONB,
place_id TEXT,
photos TEXT[],
opening_hours TEXT[],
distance JSONB,
last_updated TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
search_count INTEGER DEFAULT 0
);
CREATE TABLE IF NOT EXISTS searches (
id SERIAL PRIMARY KEY,
query TEXT NOT NULL,
location TEXT NOT NULL,
timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
results_count INTEGER
);
CREATE TABLE IF NOT EXISTS cache (
key TEXT PRIMARY KEY,
value JSONB NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP WITH TIME ZONE NOT NULL
);
-- Create indexes
CREATE INDEX IF NOT EXISTS idx_businesses_location ON businesses USING GIN (location);
CREATE INDEX IF NOT EXISTS idx_businesses_search ON businesses USING GIN (to_tsvector('english', name || ' ' || COALESCE(description, '')));
CREATE INDEX IF NOT EXISTS idx_cache_expires ON cache (expires_at);
-- Set up RLS (Row Level Security)
ALTER TABLE businesses ENABLE ROW LEVEL SECURITY;
ALTER TABLE searches ENABLE ROW LEVEL SECURITY;
ALTER TABLE cache ENABLE ROW LEVEL SECURITY;
-- Create policies
CREATE POLICY "Allow anonymous select" ON businesses FOR SELECT USING (true);
CREATE POLICY "Allow service role insert" ON businesses FOR INSERT WITH CHECK (true);
CREATE POLICY "Allow service role update" ON businesses FOR UPDATE USING (true);
CREATE POLICY "Allow anonymous select" ON searches FOR SELECT USING (true);
CREATE POLICY "Allow service role insert" ON searches FOR INSERT WITH CHECK (true);
CREATE POLICY "Allow anonymous select" ON cache FOR SELECT USING (true);
CREATE POLICY "Allow service role all" ON cache USING (true);
-- Add place_id column to businesses table if it doesn't exist
ALTER TABLE businesses ADD COLUMN IF NOT EXISTS place_id TEXT;
CREATE INDEX IF NOT EXISTS idx_businesses_place_id ON businesses(place_id);
-- Create a unique constraint on place_id (excluding nulls)
CREATE UNIQUE INDEX IF NOT EXISTS idx_businesses_place_id_unique
ON businesses(place_id)
WHERE place_id IS NOT NULL;

44
db/schema.sql Normal file
View file

@ -0,0 +1,44 @@
-- Create the businesses table
create table businesses (
id uuid primary key,
name text not null,
phone text,
address text,
city text,
state text,
zip text,
category text[],
rating numeric,
review_count integer,
license text,
services text[],
hours jsonb,
website text,
email text,
verified boolean default false,
last_updated timestamp with time zone,
search_query text,
search_location text,
search_timestamp timestamp with time zone,
reliability_score integer,
-- Create a composite index for deduplication
constraint unique_business unique (phone, address)
);
-- Create indexes for common queries
create index idx_business_location on businesses (city, state);
create index idx_business_category on businesses using gin (category);
create index idx_search_query on businesses using gin (search_query gin_trgm_ops);
create index idx_search_location on businesses using gin (search_location gin_trgm_ops);
create index idx_reliability on businesses (reliability_score);
-- Enable full text search
alter table businesses add column search_vector tsvector
generated always as (
setweight(to_tsvector('english', coalesce(name, '')), 'A') ||
setweight(to_tsvector('english', coalesce(search_query, '')), 'B') ||
setweight(to_tsvector('english', coalesce(search_location, '')), 'C')
) stored;
create index idx_business_search on businesses using gin(search_vector);

15
db/verify.sql Normal file
View file

@ -0,0 +1,15 @@
-- Check if table exists
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'public'
AND table_name = 'businesses'
);
-- Check table structure
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name = 'businesses';
-- Check row count
SELECT count(*) FROM businesses;

26
docker-compose.yml Normal file
View file

@ -0,0 +1,26 @@
version: '3'
services:
searxng:
image: searxng/searxng
ports:
- "4000:8080"
volumes:
- ./searxng:/etc/searxng
environment:
- INSTANCE_NAME=perplexica-searxng
- BASE_URL=http://localhost:4000/
- SEARXNG_SECRET=your_secret_key_here
restart: unless-stopped
app:
build:
context: .
dockerfile: backend.dockerfile
ports:
- "3000:3000"
environment:
- SEARXNG_URL=http://searxng:8080
volumes:
- ./config.toml:/home/perplexica/config.toml
depends_on:
- searxng

108
docs/ETHICAL_SCRAPING.md Normal file
View file

@ -0,0 +1,108 @@
# Ethical Web Scraping Guidelines
## Core Principles
1. **Respect Robots.txt**
- Always check and honor robots.txt directives
- Cache robots.txt to reduce server load
- Default to conservative behavior when uncertain
2. **Proper Identification**
- Use clear, identifiable User-Agent strings
- Provide contact information
- Be transparent about your purpose
3. **Rate Limiting**
- Implement conservative rate limits
- Use exponential backoff for errors
- Distribute requests over time
4. **Data Usage**
- Only collect publicly available business information
- Respect privacy and data protection laws
- Provide clear opt-out mechanisms
- Keep data accurate and up-to-date
5. **Technical Considerations**
- Cache results to minimize requests
- Handle errors gracefully
- Monitor and log access patterns
- Use structured data when available
## Implementation
1. **Request Headers**
```typescript
const headers = {
'User-Agent': 'BizSearch/1.0 (+https://bizsearch.com/about)',
'Accept': 'text/html,application/xhtml+xml',
'From': 'contact@bizsearch.com'
};
```
2. **Rate Limiting**
```typescript
const rateLimits = {
requestsPerMinute: 10,
requestsPerHour: 100,
requestsPerDomain: 20
};
```
3. **Caching**
```typescript
const cacheSettings = {
ttl: 24 * 60 * 60, // 24 hours
maxSize: 1000 // entries
};
```
## Opt-Out Process
1. Business owners can opt-out by:
- Submitting a form on our website
- Emailing opt-out@bizsearch.com
- Adding a meta tag: `<meta name="bizsearch" content="noindex">`
2. We honor opt-outs within:
- 24 hours for direct requests
- 72 hours for cached data
## Legal Compliance
1. **Data Protection**
- GDPR compliance for EU businesses
- CCPA compliance for California businesses
- Regular data audits and cleanup
2. **Attribution**
- Clear source attribution
- Last-updated timestamps
- Data accuracy disclaimers
## Best Practices
1. **Before Scraping**
- Check robots.txt
- Verify site status
- Review terms of service
- Look for API alternatives
2. **During Scraping**
- Monitor response codes
- Respect server hints
- Implement backoff strategies
- Log access patterns
3. **After Scraping**
- Verify data accuracy
- Update cache entries
- Clean up old data
- Monitor opt-out requests
## Contact
For questions or concerns about our scraping practices:
- Email: ethics@bizsearch.com
- Phone: (555) 123-4567
- Web: https://bizsearch.com/ethics

7157
package-lock.json generated Normal file

File diff suppressed because it is too large Load diff

View file

@ -9,7 +9,9 @@
"dev": "nodemon --ignore uploads/ src/app.ts ",
"db:push": "drizzle-kit push sqlite",
"format": "prettier . --check",
"format:write": "prettier . --write"
"format:write": "prettier . --write",
"test:search": "ts-node src/tests/testSearch.ts",
"test:supabase": "ts-node src/tests/supabaseTest.ts"
},
"devDependencies": {
"@types/better-sqlite3": "^7.6.10",
@ -30,15 +32,17 @@
"@iarna/toml": "^2.2.5",
"@langchain/anthropic": "^0.2.3",
"@langchain/community": "^0.2.16",
"@langchain/openai": "^0.0.25",
"@langchain/google-genai": "^0.0.23",
"@langchain/openai": "^0.0.25",
"@supabase/supabase-js": "^2.47.10",
"@xenova/transformers": "^2.17.1",
"axios": "^1.6.8",
"better-sqlite3": "^11.0.0",
"better-sqlite3": "^11.7.0",
"cheerio": "^1.0.0",
"compute-cosine-similarity": "^1.1.0",
"compute-dot": "^1.1.0",
"cors": "^2.8.5",
"dotenv": "^16.4.5",
"dotenv": "^16.4.7",
"drizzle-orm": "^0.31.2",
"express": "^4.19.2",
"html-to-text": "^9.0.5",
@ -46,6 +50,8 @@
"mammoth": "^1.8.0",
"multer": "^1.4.5-lts.1",
"pdf-parse": "^1.1.1",
"robots-parser": "^3.0.1",
"tesseract.js": "^4.1.4",
"winston": "^3.13.0",
"ws": "^8.17.1",
"zod": "^3.22.4"

558
public/index.html Normal file
View file

@ -0,0 +1,558 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>OffMarket Pro - Find Off-Market Property Services</title>
<style>
:root {
--primary-color: #2c3e50;
--secondary-color: #3498db;
--accent-color: #e74c3c;
--background-color: #f8f9fa;
--text-color: #2c3e50;
--border-radius: 8px;
--card-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
body {
font-family: 'Segoe UI', system-ui, -apple-system, sans-serif;
margin: 0;
padding: 0;
background: var(--background-color);
color: var(--text-color);
}
.header {
background: white;
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
padding: 1rem;
}
.logo {
font-size: 1.8rem;
font-weight: bold;
color: var(--primary-color);
text-decoration: none;
}
.search-container {
max-width: 1200px;
margin: 3rem auto;
padding: 2rem;
text-align: center;
}
.search-box {
display: flex;
gap: 1rem;
max-width: 800px;
margin: 2rem auto;
}
.search-input {
flex: 1;
padding: 1rem;
border: 2px solid #ddd;
border-radius: var(--border-radius);
font-size: 1rem;
}
.search-button {
padding: 1rem 2rem;
background: var(--secondary-color);
color: white;
border: none;
border-radius: var(--border-radius);
cursor: pointer;
font-size: 1rem;
transition: background 0.2s;
}
.search-button:hover {
background: #2980b9;
}
.categories-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
gap: 1.5rem;
margin: 2rem auto;
max-width: 1200px;
padding: 0 1rem;
}
.category-card {
background: white;
border-radius: var(--border-radius);
padding: 1.5rem;
box-shadow: var(--card-shadow);
transition: transform 0.2s;
cursor: pointer;
}
.category-card:hover {
transform: translateY(-2px);
}
.category-icon {
font-size: 2rem;
margin-bottom: 1rem;
}
.subcategories {
margin-top: 1rem;
font-size: 0.9rem;
}
.subcategory {
color: var(--secondary-color);
cursor: pointer;
margin: 0.25rem 0;
}
.subcategory:hover {
text-decoration: underline;
}
.results-container {
max-width: 1200px;
margin: 2rem auto;
padding: 0 1rem;
}
.business-card {
background: white;
padding: 1.5rem;
margin-bottom: 1rem;
border-radius: var(--border-radius);
box-shadow: 0 2px 4px rgba(0,0,0,0.05);
display: grid;
grid-template-columns: auto 1fr auto;
gap: 1.5rem;
align-items: start;
}
.business-logo {
width: 80px;
height: 80px;
object-fit: cover;
border-radius: var(--border-radius);
}
.business-info h3 {
margin: 0 0 0.5rem 0;
color: var(--primary-color);
}
.business-contact {
text-align: right;
}
.rating-stars {
color: #f1c40f;
margin-bottom: 0.5rem;
}
@media (max-width: 768px) {
.search-box {
flex-direction: column;
}
.business-card {
grid-template-columns: 1fr;
text-align: center;
}
.business-contact {
text-align: center;
}
.business-logo {
margin: 0 auto;
}
}
.loading {
text-align: center;
padding: 2rem;
color: var(--text-color);
}
.error-message {
background: #fee;
border: 1px solid #fcc;
padding: 1rem;
border-radius: var(--border-radius);
text-align: center;
}
.no-results {
background: #f8f9fa;
padding: 2rem;
border-radius: var(--border-radius);
text-align: center;
}
.no-results ul {
text-align: left;
display: inline-block;
margin: 1rem auto;
}
.results-table {
width: 100%;
border-collapse: collapse;
margin-top: 2rem;
background: white;
box-shadow: var(--card-shadow);
border-radius: var(--border-radius);
overflow: hidden;
}
.results-table th {
background: #f8f9fa;
padding: 1rem;
text-align: left;
font-weight: 600;
color: var(--primary-color);
border-bottom: 2px solid #eee;
}
.results-table td {
padding: 1rem;
border-bottom: 1px solid #eee;
vertical-align: top;
}
.business-icon {
width: 50px;
height: 50px;
display: flex;
align-items: center;
justify-content: center;
background: #f0f0f0;
border-radius: var(--border-radius);
font-size: 1.5rem;
color: var(--primary-color);
}
.business-info {
display: flex;
gap: 1rem;
align-items: start;
}
.business-details h3 {
margin: 0 0 0.5rem 0;
color: var(--primary-color);
}
.business-meta {
font-size: 0.9rem;
color: #666;
}
.rating {
display: flex;
align-items: center;
gap: 0.25rem;
color: #f39c12;
}
.contact-info {
text-align: right;
white-space: nowrap;
}
.phone {
font-weight: 600;
color: var(--primary-color);
margin-bottom: 0.25rem;
}
.address {
color: #666;
font-size: 0.9rem;
}
.action-buttons {
display: flex;
gap: 0.5rem;
justify-content: flex-end;
}
.action-button {
padding: 0.5rem 1rem;
border: none;
border-radius: var(--border-radius);
cursor: pointer;
font-size: 0.9rem;
}
.primary-button {
background: var(--secondary-color);
color: white;
}
.secondary-button {
background: #eee;
color: var(--text-color);
}
</style>
</head>
<body>
<header class="header">
<a href="/" class="logo">OffMarket Pro</a>
</header>
<div class="search-container">
<h1>Find Off-Market Property Services</h1>
<div class="search-box">
<input type="text" id="searchQuery" placeholder="What service are you looking for?">
<input type="text" id="searchLocation" placeholder="Location">
<button onclick="performSearch()">Search</button>
</div>
</div>
<div class="categories-grid">
<!-- Categories will be dynamically inserted here -->
</div>
<div class="container">
<table class="results-table">
<thead>
<tr>
<th style="width: 50%">Business</th>
<th style="width: 30%">Contact</th>
<th style="width: 20%">Actions</th>
</tr>
</thead>
<tbody id="resultsBody">
<!-- Results will be populated here -->
</tbody>
</table>
<div id="searchProgress" class="search-progress"></div>
</div>
<script>
// Load categories
fetch('/api/categories')
.then(response => response.json())
.then(categories => {
const grid = document.querySelector('.categories-grid');
grid.innerHTML = categories.map(category => `
<div class="category-card" onclick="searchCategory('${category.name}')">
<div class="category-icon">${category.icon}</div>
<h3>${category.name}</h3>
<div class="subcategories">
${category.subcategories.map(sub =>
`<div class="subcategory" onclick="event.stopPropagation(); searchSubcategory('${sub.name}')">${sub.name}</div>`
).join('')}
</div>
</div>
`).join('');
});
async function performSearch() {
const query = document.getElementById('searchQuery').value;
const location = document.getElementById('searchLocation').value;
if (!query || !location) {
alert('Please enter both search query and location');
return;
}
await doSearch(query, location);
}
function searchCategory(category) {
const location = document.getElementById('searchLocation').value;
if (!location) {
alert('Please enter a location first');
return;
}
document.getElementById('searchQuery').value = category;
performSearch();
}
// Add searchSubcategory function
function searchSubcategory(subcategory) {
const location = document.getElementById('searchLocation').value;
if (!location) {
alert('Please enter a location first');
return;
}
document.getElementById('searchQuery').value = subcategory;
performSearch();
}
// Update doSearch function
async function doSearch(query, location) {
const searchTerm = `${query} in ${location}`;
const resultsBody = document.getElementById('resultsBody');
const progressDiv = document.getElementById('searchProgress');
try {
resultsBody.innerHTML = `
<tr>
<td colspan="3" class="loading">
<p>Searching for ${query} in ${location}...</p>
</td>
</tr>
`;
const response = await fetch(`/api/search?q=${encodeURIComponent(searchTerm)}`);
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
let allResults = new Set(); // Use Set to avoid duplicates
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Process complete chunks
const chunks = buffer.split('\n');
buffer = chunks.pop() || ''; // Keep the incomplete chunk
for (const chunk of chunks) {
if (!chunk.trim()) continue;
try {
const data = JSON.parse(chunk);
console.log('Received chunk:', data);
if (data.source === 'database' || (data.source === 'search' && data.results)) {
// Add new results to our set
data.results.forEach(result => {
allResults.add(JSON.stringify(result)); // Convert to string for Set storage
});
// Display all current results
displayResults(Array.from(allResults).map(str => JSON.parse(str)));
} else if (data.status && data.progress) {
// Update progress
progressDiv.innerHTML = `
<p>${data.status} (${data.progress}% complete)</p>
`;
}
} catch (e) {
console.error('Error parsing chunk:', e);
}
}
}
// Clear progress when done
progressDiv.innerHTML = '';
} catch (error) {
console.error('Search error:', error);
resultsBody.innerHTML = `
<tr>
<td colspan="3" class="error-message">
<h3>Search Error</h3>
<p>Sorry, we encountered an error while searching. Please try again.</p>
<p>Error details: ${error.message}</p>
<button onclick="performSearch()" class="search-button">Try Again</button>
</td>
</tr>
`;
}
}
function displayResults(businesses) {
const resultsBody = document.getElementById('resultsBody');
if (!businesses || businesses.length === 0) {
resultsBody.innerHTML = `
<tr>
<td colspan="3" style="text-align: center; padding: 2rem;">
<h3>No Results Found</h3>
<p>We couldn't find any businesses matching your search.</p>
</td>
</tr>
`;
return;
}
resultsBody.innerHTML = businesses.map(business => {
const icon = getBusinessIcon(business.name);
const rating = business.rating ? (business.rating / 20).toFixed(1) : 0; // Convert to 5-star scale
return `
<tr>
<td>
<div class="business-info">
<div class="business-icon">${icon}</div>
<div class="business-details">
<h3>${business.name}</h3>
<div class="business-meta">
<div class="rating">
${getRatingStars(rating)}
<span>(${rating})</span>
</div>
<div class="description">${business.description || ''}</div>
</div>
</div>
</div>
</td>
<td>
<div class="contact-info">
<div class="phone">${business.phone || 'No phone available'}</div>
<div class="address">${business.address || 'Address not available'}</div>
</div>
</td>
<td>
<div class="action-buttons">
${business.website ?
`<a href="${business.website}" target="_blank" class="action-button primary-button">Visit Website</a>` :
'<button class="action-button secondary-button" disabled>No Website</button>'
}
<button onclick="contactBusiness('${business.id}')" class="action-button secondary-button">Contact</button>
</div>
</td>
</tr>
`;
}).join('');
}
// Helper function to get business icon
function getBusinessIcon(businessName) {
// Map of business types to icons
const icons = {
'real estate': '🏢',
'legal': '⚖️',
'financial': '💰',
'contractor': '🔨',
'property': '🏠',
'marketing': '📢',
'tech': '💻',
'default': '🏢'
};
// Determine business type from name
const businessType = Object.keys(icons).find(type =>
businessName.toLowerCase().includes(type)
) || 'default';
return icons[businessType];
}
// Helper function to generate rating stars
function getRatingStars(rating) {
const fullStars = Math.floor(rating);
const hasHalfStar = rating % 1 >= 0.5;
const emptyStars = 5 - fullStars - (hasHalfStar ? 1 : 0);
return `
${'★'.repeat(fullStars)}
${hasHalfStar ? '½' : ''}
${'☆'.repeat(emptyStars)}
`;
}
</script>
</body>
</html>

View file

@ -1,38 +1,48 @@
import { startWebSocketServer } from './websocket';
import express from 'express';
import cors from 'cors';
import http from 'http';
import routes from './routes';
import { getPort } from './config';
import logger from './utils/logger';
const port = getPort();
import path from 'path';
import './config/env'; // Load environment variables first
import apiRoutes from './routes/api';
import { HealthCheckService } from './lib/services/healthCheck';
const app = express();
const server = http.createServer(app);
const port = process.env.PORT || 3000;
const corsOptions = {
origin: '*',
};
app.use(cors(corsOptions));
// Middleware
app.use(cors());
app.use(express.json());
app.use('/api', routes);
app.get('/api', (_, res) => {
res.status(200).json({ status: 'ok' });
// API routes first
app.use('/api', apiRoutes);
// Then static files
app.use(express.static(path.join(__dirname, '../public')));
// Finally, catch-all route for SPA
app.get('*', (req, res) => {
res.sendFile(path.join(__dirname, '../public/index.html'));
});
server.listen(port, () => {
logger.info(`Server is running on port ${port}`);
});
// Start server with health checks
async function startServer() {
console.log('\n🔍 Checking required services...');
const ollamaStatus = await HealthCheckService.checkOllama();
const searxngStatus = await HealthCheckService.checkSearxNG();
const supabaseStatus = await HealthCheckService.checkSupabase();
startWebSocketServer(server);
console.log('\n📊 Service Status:');
console.log('- Ollama:', ollamaStatus ? '✅ Running' : '❌ Not Running');
console.log('- SearxNG:', searxngStatus ? '✅ Running' : '❌ Not Running');
console.log('- Supabase:', supabaseStatus ? '✅ Connected' : '❌ Not Connected');
process.on('uncaughtException', (err, origin) => {
logger.error(`Uncaught Exception at ${origin}: ${err}`);
});
app.listen(port, () => {
console.log(`\n🚀 Server running at http://localhost:${port}`);
console.log('-------------------------------------------');
});
}
process.on('unhandledRejection', (reason, promise) => {
logger.error(`Unhandled Rejection at: ${promise}, reason: ${reason}`);
startServer().catch(error => {
console.error('Failed to start server:', error);
process.exit(1);
});

View file

@ -77,3 +77,16 @@ export const updateConfig = (config: RecursivePartial<Config>) => {
toml.stringify(config),
);
};
export const config = {
ollama: {
url: process.env.OLLAMA_URL || 'http://localhost:11434',
model: process.env.OLLAMA_MODEL || 'mistral',
options: {
temperature: 0.1,
top_p: 0.9,
timeout: 30000 // 30 seconds timeout
}
},
// ... other config
};

68
src/config/env.ts Normal file
View file

@ -0,0 +1,68 @@
import { config } from 'dotenv';
import { z } from 'zod';
config();
// Define the environment schema
const envSchema = z.object({
PORT: z.string().default('3000'),
NODE_ENV: z.string().default('development'),
SUPABASE_URL: z.string(),
SUPABASE_KEY: z.string(),
OLLAMA_URL: z.string().default('http://localhost:11434'),
OLLAMA_MODEL: z.string().default('llama2'),
SEARXNG_URL: z.string().default('http://localhost:4000'),
SEARXNG_INSTANCES: z.string().default('["http://localhost:4000"]'),
MAX_RESULTS_PER_QUERY: z.string().default('50'),
CACHE_DURATION_HOURS: z.string().default('24'),
CACHE_DURATION_DAYS: z.string().default('7')
});
// Define the final environment type
export interface EnvConfig {
PORT: string;
NODE_ENV: string;
searxng: {
currentUrl: string;
instances: string[];
};
ollama: {
url: string;
model: string;
};
supabase: {
url: string;
anonKey: string;
};
cache: {
maxResultsPerQuery: number;
durationHours: number;
durationDays: number;
};
}
// Parse and transform the environment variables
const rawEnv = envSchema.parse(process.env);
// Create the final environment object with parsed configurations
export const env: EnvConfig = {
PORT: rawEnv.PORT,
NODE_ENV: rawEnv.NODE_ENV,
searxng: {
currentUrl: rawEnv.SEARXNG_URL,
instances: JSON.parse(rawEnv.SEARXNG_INSTANCES)
},
ollama: {
url: rawEnv.OLLAMA_URL,
model: rawEnv.OLLAMA_MODEL
},
supabase: {
url: rawEnv.SUPABASE_URL,
anonKey: rawEnv.SUPABASE_KEY
},
cache: {
maxResultsPerQuery: parseInt(rawEnv.MAX_RESULTS_PER_QUERY),
durationHours: parseInt(rawEnv.CACHE_DURATION_HOURS),
durationDays: parseInt(rawEnv.CACHE_DURATION_DAYS)
}
};

77
src/config/index.ts Normal file
View file

@ -0,0 +1,77 @@
import dotenv from 'dotenv';
import path from 'path';
// Load .env file
dotenv.config({ path: path.resolve(__dirname, '../../.env') });
export interface Config {
supabase: {
url: string;
anonKey: string;
};
server: {
port: number;
nodeEnv: string;
};
search: {
maxResultsPerQuery: number;
cacheDurationHours: number;
searxngUrl?: string;
};
rateLimit: {
windowMs: number;
maxRequests: number;
};
security: {
corsOrigin: string;
jwtSecret: string;
};
proxy?: {
http?: string;
https?: string;
};
logging: {
level: string;
};
}
const config: Config = {
supabase: {
url: process.env.SUPABASE_URL || '',
anonKey: process.env.SUPABASE_ANON_KEY || '',
},
server: {
port: parseInt(process.env.PORT || '3000', 10),
nodeEnv: process.env.NODE_ENV || 'development',
},
search: {
maxResultsPerQuery: parseInt(process.env.MAX_RESULTS_PER_QUERY || '20', 10),
cacheDurationHours: parseInt(process.env.CACHE_DURATION_HOURS || '24', 10),
searxngUrl: process.env.SEARXNG_URL
},
rateLimit: {
windowMs: parseInt(process.env.RATE_LIMIT_WINDOW_MS || '900000', 10),
maxRequests: parseInt(process.env.RATE_LIMIT_MAX_REQUESTS || '100', 10),
},
security: {
corsOrigin: process.env.CORS_ORIGIN || 'http://localhost:3000',
jwtSecret: process.env.JWT_SECRET || 'your_jwt_secret_key',
},
logging: {
level: process.env.LOG_LEVEL || 'info',
},
};
// Validate required configuration
const validateConfig = () => {
if (!config.supabase.url) {
throw new Error('SUPABASE_URL is required');
}
if (!config.supabase.anonKey) {
throw new Error('SUPABASE_ANON_KEY is required');
}
};
validateConfig();
export { config };

116
src/lib/categories.ts Normal file
View file

@ -0,0 +1,116 @@
export interface Category {
id: string;
name: string;
icon: string;
subcategories: SubCategory[];
}
export interface SubCategory {
id: string;
name: string;
}
export const categories: Category[] = [
{
id: 'real-estate-pros',
name: 'Real Estate Professionals',
icon: '🏢',
subcategories: [
{ id: 'wholesalers', name: 'Real Estate Wholesalers' },
{ id: 'agents', name: 'Real Estate Agents' },
{ id: 'attorneys', name: 'Real Estate Attorneys' },
{ id: 'scouts', name: 'Property Scouts' },
{ id: 'brokers', name: 'Real Estate Brokers' },
{ id: 'consultants', name: 'Real Estate Consultants' }
]
},
{
id: 'legal-title',
name: 'Legal & Title Services',
icon: '⚖️',
subcategories: [
{ id: 'title-companies', name: 'Title Companies' },
{ id: 'closing-attorneys', name: 'Closing Attorneys' },
{ id: 'zoning-consultants', name: 'Zoning Consultants' },
{ id: 'probate-specialists', name: 'Probate Specialists' },
{ id: 'eviction-specialists', name: 'Eviction Specialists' }
]
},
{
id: 'financial',
name: 'Financial Services',
icon: '💰',
subcategories: [
{ id: 'hard-money', name: 'Hard Money Lenders' },
{ id: 'private-equity', name: 'Private Equity Investors' },
{ id: 'mortgage-brokers', name: 'Mortgage Brokers' },
{ id: 'tax-advisors', name: 'Tax Advisors' },
{ id: 'appraisers', name: 'Appraisers' }
]
},
{
id: 'contractors',
name: 'Specialist Contractors',
icon: '🔨',
subcategories: [
{ id: 'general', name: 'General Contractors' },
{ id: 'plumbers', name: 'Plumbers' },
{ id: 'electricians', name: 'Electricians' },
{ id: 'hvac', name: 'HVAC Technicians' },
{ id: 'roofers', name: 'Roofers' },
{ id: 'foundation', name: 'Foundation Specialists' },
{ id: 'asbestos', name: 'Asbestos Removal' },
{ id: 'mold', name: 'Mold Remediation' }
]
},
{
id: 'property-services',
name: 'Property Services',
icon: '🏠',
subcategories: [
{ id: 'surveyors', name: 'Surveyors' },
{ id: 'inspectors', name: 'Inspectors' },
{ id: 'property-managers', name: 'Property Managers' },
{ id: 'environmental', name: 'Environmental Consultants' },
{ id: 'junk-removal', name: 'Junk Removal Services' },
{ id: 'cleaning', name: 'Property Cleaning' }
]
},
{
id: 'marketing',
name: 'Marketing & Lead Gen',
icon: '📢',
subcategories: [
{ id: 'direct-mail', name: 'Direct Mail Services' },
{ id: 'social-media', name: 'Social Media Marketing' },
{ id: 'seo', name: 'SEO Specialists' },
{ id: 'ppc', name: 'PPC Advertising' },
{ id: 'lead-gen', name: 'Lead Generation' },
{ id: 'skip-tracing', name: 'Skip Tracing Services' }
]
},
{
id: 'data-tech',
name: 'Data & Technology',
icon: '💻',
subcategories: [
{ id: 'data-providers', name: 'Property Data Providers' },
{ id: 'crm', name: 'CRM Systems' },
{ id: 'valuation', name: 'Valuation Tools' },
{ id: 'virtual-tours', name: 'Virtual Tour Services' },
{ id: 'automation', name: 'Automation Tools' }
]
},
{
id: 'specialty',
name: 'Specialty Services',
icon: '🎯',
subcategories: [
{ id: 'auction', name: 'Auction Companies' },
{ id: 'relocation', name: 'Relocation Services' },
{ id: 'staging', name: 'Home Staging' },
{ id: 'photography', name: 'Real Estate Photography' },
{ id: 'virtual-assistant', name: 'Virtual Assistants' }
]
}
];

51
src/lib/db/optOutDb.ts Normal file
View file

@ -0,0 +1,51 @@
import { Database } from 'better-sqlite3';
import path from 'path';
interface OptOutEntry {
domain: string;
email: string;
reason?: string;
timestamp: Date;
}
export class OptOutDatabase {
private db: Database;
constructor() {
this.db = new Database(path.join(__dirname, '../../../data/optout.db'));
this.initializeDatabase();
}
private initializeDatabase() {
this.db.exec(`
CREATE TABLE IF NOT EXISTS opt_outs (
domain TEXT PRIMARY KEY,
email TEXT NOT NULL,
reason TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_domain ON opt_outs(domain);
`);
}
async addOptOut(entry: OptOutEntry): Promise<void> {
const stmt = this.db.prepare(
'INSERT OR REPLACE INTO opt_outs (domain, email, reason, timestamp) VALUES (?, ?, ?, ?)'
);
stmt.run(entry.domain, entry.email, entry.reason, entry.timestamp.toISOString());
}
isOptedOut(domain: string): boolean {
const stmt = this.db.prepare('SELECT 1 FROM opt_outs WHERE domain = ?');
return stmt.get(domain) !== undefined;
}
removeOptOut(domain: string): void {
const stmt = this.db.prepare('DELETE FROM opt_outs WHERE domain = ?');
stmt.run(domain);
}
getOptOutList(): OptOutEntry[] {
return this.db.prepare('SELECT * FROM opt_outs').all();
}
}

74
src/lib/db/supabase.ts Normal file
View file

@ -0,0 +1,74 @@
import { createClient } from '@supabase/supabase-js';
import { BusinessData } from '../searxng';
import { env } from '../../config/env';
// Create the Supabase client with validated environment variables
export const supabase = createClient(
env.supabase.url,
env.supabase.anonKey,
{
auth: {
persistSession: false // Since this is a server environment
}
}
);
// Define the cache record type
export interface CacheRecord {
id: string;
query: string;
results: BusinessData[];
location: string;
category: string;
created_at: string;
updated_at: string;
expires_at: string;
}
// Export database helper functions
export async function getCacheEntry(
category: string,
location: string
): Promise<CacheRecord | null> {
const { data, error } = await supabase
.from('search_cache')
.select('*')
.eq('category', category.toLowerCase())
.eq('location', location.toLowerCase())
.gt('expires_at', new Date().toISOString())
.order('created_at', { ascending: false })
.limit(1)
.single();
if (error) {
console.error('Cache lookup failed:', error);
return null;
}
return data;
}
export async function saveCacheEntry(
category: string,
location: string,
results: BusinessData[],
expiresInDays: number = 7
): Promise<void> {
const expiresAt = new Date();
expiresAt.setDate(expiresAt.getDate() + expiresInDays);
const { error } = await supabase
.from('search_cache')
.insert({
query: `${category} in ${location}`,
category: category.toLowerCase(),
location: location.toLowerCase(),
results,
expires_at: expiresAt.toISOString()
});
if (error) {
console.error('Failed to save cache entry:', error);
throw error;
}
}

195
src/lib/emailScraper.ts Normal file
View file

@ -0,0 +1,195 @@
import axios from 'axios';
import * as cheerio from 'cheerio';
import { Cache } from './utils/cache';
import { RateLimiter } from './utils/rateLimiter';
import robotsParser from 'robots-parser';
interface ScrapingResult {
emails: string[];
phones: string[];
addresses: string[];
socialLinks: string[];
source: string;
timestamp: Date;
attribution: string;
}
export class EmailScraper {
private cache: Cache<ScrapingResult>;
private rateLimiter: RateLimiter;
private robotsCache = new Map<string, any>();
constructor(private options = {
timeout: 5000,
cacheTTL: 60,
rateLimit: { windowMs: 60000, maxRequests: 10 }, // More conservative rate limiting
userAgent: 'BizSearch/1.0 (+https://your-domain.com/about) - Business Directory Service'
}) {
this.cache = new Cache<ScrapingResult>(options.cacheTTL);
this.rateLimiter = new RateLimiter(options.rateLimit.windowMs, options.rateLimit.maxRequests);
}
private async checkRobotsPermission(url: string): Promise<boolean> {
try {
const { protocol, host } = new URL(url);
const robotsUrl = `${protocol}//${host}/robots.txt`;
let parser = this.robotsCache.get(host);
if (!parser) {
const response = await axios.get(robotsUrl);
parser = robotsParser(robotsUrl, response.data);
this.robotsCache.set(host, parser);
}
return parser.isAllowed(url, this.options.userAgent);
} catch (error) {
console.warn(`Could not check robots.txt for ${url}:`, error);
return true; // Assume allowed if robots.txt is unavailable
}
}
async scrapeEmails(url: string): Promise<ScrapingResult> {
// Check cache first
const cached = this.cache.get(url);
if (cached) return cached;
// Check robots.txt
const allowed = await this.checkRobotsPermission(url);
if (!allowed) {
console.log(`Respecting robots.txt disallow for ${url}`);
return {
emails: [],
phones: [],
addresses: [],
socialLinks: [],
source: url,
timestamp: new Date(),
attribution: 'Restricted by robots.txt'
};
}
// Wait for rate limiting slot
await this.rateLimiter.waitForSlot();
try {
const response = await axios.get(url, {
timeout: this.options.timeout,
headers: {
'User-Agent': this.options.userAgent,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}
});
// Check for noindex meta tag
const $ = cheerio.load(response.data);
if ($('meta[name="robots"][content*="noindex"]').length > 0) {
return {
emails: [],
phones: [],
addresses: [],
socialLinks: [],
source: url,
timestamp: new Date(),
attribution: 'Respecting noindex directive'
};
}
// Only extract contact information from public contact pages or structured data
const isContactPage = /contact|about/i.test(url) ||
$('h1, h2').text().toLowerCase().includes('contact');
const result = {
emails: new Set<string>(),
phones: new Set<string>(),
addresses: new Set<string>(),
socialLinks: new Set<string>(),
source: url,
timestamp: new Date(),
attribution: `Data from public business listing at ${new URL(url).hostname}`
};
// Extract from structured data (Schema.org)
$('script[type="application/ld+json"]').each((_, element) => {
try {
const data = JSON.parse($(element).html() || '{}');
if (data['@type'] === 'LocalBusiness' || data['@type'] === 'Organization') {
if (data.email) result.emails.add(data.email.toLowerCase());
if (data.telephone) result.phones.add(this.formatPhoneNumber(data.telephone));
if (data.address) {
const fullAddress = this.formatAddress(data.address);
if (fullAddress) result.addresses.add(fullAddress);
}
}
} catch (e) {
console.error('Error parsing JSON-LD:', e);
}
});
// Only scrape additional info if it's a contact page
if (isContactPage) {
// Extract clearly marked contact information
$('[itemprop="email"], .contact-email, .email').each((_, element) => {
const email = $(element).text().trim();
if (this.isValidEmail(email)) {
result.emails.add(email.toLowerCase());
}
});
$('[itemprop="telephone"], .phone, .contact-phone').each((_, element) => {
const phone = $(element).text().trim();
const formatted = this.formatPhoneNumber(phone);
if (formatted) result.phones.add(formatted);
});
}
const finalResult = {
...result,
emails: Array.from(result.emails),
phones: Array.from(result.phones),
addresses: Array.from(result.addresses),
socialLinks: Array.from(result.socialLinks)
};
this.cache.set(url, finalResult);
return finalResult;
} catch (error) {
console.error(`Failed to scrape ${url}:`, error);
return {
emails: [],
phones: [],
addresses: [],
socialLinks: [],
source: url,
timestamp: new Date(),
attribution: 'Error accessing page'
};
}
}
private isValidEmail(email: string): boolean {
return /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/.test(email);
}
private formatPhoneNumber(phone: string): string {
const digits = phone.replace(/\D/g, '');
if (digits.length === 10) {
return `(${digits.slice(0,3)}) ${digits.slice(3,6)}-${digits.slice(6)}`;
}
return phone;
}
private formatAddress(address: any): string | null {
if (typeof address === 'string') return address;
if (typeof address === 'object') {
const parts = [
address.streetAddress,
address.addressLocality,
address.addressRegion,
address.postalCode
].filter(Boolean);
if (parts.length > 0) return parts.join(', ');
}
return null;
}
}

View file

@ -0,0 +1,19 @@
import { Business, SearchParams } from '../../../types/business';
import { WebScraperProvider } from './webScraper';
export class BusinessProvider {
private scraper: WebScraperProvider;
constructor() {
this.scraper = new WebScraperProvider();
}
async search(params: SearchParams): Promise<Business[]> {
return this.scraper.search(params);
}
async getDetails(businessId: string): Promise<Business | null> {
// Implement detailed business lookup using stored data or additional scraping
return null;
}
}

View file

@ -0,0 +1,111 @@
import { Business, SearchParams } from '../../../types/business';
import { searchWeb } from '../search'; // This is Perplexica's existing search function
import { parseHTML } from '../utils/parser';
export class WebScraperProvider {
async search(params: SearchParams): Promise<Business[]> {
const searchQueries = this.generateQueries(params);
const businesses: Business[] = [];
for (const query of searchQueries) {
// Use Perplexica's existing search functionality
const results = await searchWeb(query, {
maxResults: 20,
type: 'general' // or 'news' depending on what we want
});
for (const result of results) {
try {
const html = await fetch(result.url).then(res => res.text());
const businessData = await this.extractBusinessData(html, result.url);
if (businessData) {
businesses.push(businessData);
}
} catch (error) {
console.error(`Failed to extract data from ${result.url}:`, error);
}
}
}
return this.deduplicateBusinesses(businesses);
}
private generateQueries(params: SearchParams): string[] {
const { location, category } = params;
return [
`${category} in ${location}`,
`${category} business ${location}`,
`best ${category} near ${location}`,
`${category} services ${location} reviews`
];
}
private async extractBusinessData(html: string, sourceUrl: string): Promise<Business | null> {
const $ = parseHTML(html);
// Different extraction logic based on source
if (sourceUrl.includes('yelp.com')) {
return this.extractYelpData($);
} else if (sourceUrl.includes('yellowpages.com')) {
return this.extractYellowPagesData($);
}
// ... other source-specific extractors
return null;
}
private extractYelpData($: any): Business | null {
try {
return {
id: crypto.randomUUID(),
name: $('.business-name').text().trim(),
phone: $('.phone-number').text().trim(),
address: $('.address').text().trim(),
city: $('.city').text().trim(),
state: $('.state').text().trim(),
zip: $('.zip').text().trim(),
category: $('.category-str-list').text().split(',').map(s => s.trim()),
rating: parseFloat($('.rating').text()),
reviewCount: parseInt($('.review-count').text()),
services: $('.services-list').text().split(',').map(s => s.trim()),
hours: this.extractHours($),
website: $('.website-link').attr('href'),
verified: false,
lastUpdated: new Date()
};
} catch (error) {
return null;
}
}
private deduplicateBusinesses(businesses: Business[]): Business[] {
// Group by phone number and address to identify duplicates
const uniqueBusinesses = new Map<string, Business>();
for (const business of businesses) {
const key = `${business.phone}-${business.address}`.toLowerCase();
if (!uniqueBusinesses.has(key)) {
uniqueBusinesses.set(key, business);
} else {
// Merge data if we have additional information
const existing = uniqueBusinesses.get(key)!;
uniqueBusinesses.set(key, this.mergeBusinessData(existing, business));
}
}
return Array.from(uniqueBusinesses.values());
}
private mergeBusinessData(existing: Business, newData: Business): Business {
return {
...existing,
services: [...new Set([...existing.services, ...newData.services])],
rating: (existing.rating + newData.rating) / 2,
reviewCount: existing.reviewCount + newData.reviewCount,
// Keep the most complete data for other fields
website: existing.website || newData.website,
email: existing.email || newData.email,
hours: existing.hours || newData.hours
};
}
}

54
src/lib/search.ts Normal file
View file

@ -0,0 +1,54 @@
import axios from 'axios';
import { config } from '../config';
interface SearchOptions {
maxResults?: number;
type?: 'general' | 'news';
engines?: string[];
}
interface SearchResult {
url: string;
title: string;
content: string;
score?: number;
}
export async function searchWeb(
query: string,
options: SearchOptions = {}
): Promise<SearchResult[]> {
const {
maxResults = 20,
type = 'general',
engines = ['google', 'bing', 'duckduckgo']
} = options;
try {
const response = await axios.get(`${config.search.searxngUrl || process.env.SEARXNG_URL}/search`, {
params: {
q: query,
format: 'json',
categories: type,
engines: engines.join(','),
limit: maxResults
}
});
if (!response.data || !response.data.results) {
console.error('Invalid response from SearxNG:', response.data);
return [];
}
return response.data.results.map((result: any) => ({
url: result.url,
title: result.title,
content: result.content || result.snippet || '',
score: result.score
}));
} catch (error) {
console.error('Search failed:', error);
throw error;
}
}

View file

@ -0,0 +1,111 @@
import axios from 'axios';
import * as cheerio from 'cheerio';
import { Cache } from '../utils/cache';
import { RateLimiter } from '../utils/rateLimiter';
interface CrawlResult {
mainContent: string;
contactInfo: string;
aboutInfo: string;
structuredData: any;
}
export class BusinessCrawler {
private cache: Cache<CrawlResult>;
private rateLimiter: RateLimiter;
constructor() {
this.cache = new Cache<CrawlResult>(60); // 1 hour cache
this.rateLimiter = new RateLimiter();
}
async crawlBusinessSite(url: string): Promise<CrawlResult> {
// Check cache first
const cached = this.cache.get(url);
if (cached) return cached;
await this.rateLimiter.waitForSlot();
try {
const mainPage = await this.fetchPage(url);
const $ = cheerio.load(mainPage);
// Get all important URLs
const contactUrl = this.findContactPage($, url);
const aboutUrl = this.findAboutPage($, url);
// Crawl additional pages
const [contactPage, aboutPage] = await Promise.all([
contactUrl ? this.fetchPage(contactUrl) : '',
aboutUrl ? this.fetchPage(aboutUrl) : ''
]);
// Extract structured data
const structuredData = this.extractStructuredData($);
const result = {
mainContent: $('body').text(),
contactInfo: contactPage,
aboutInfo: aboutPage,
structuredData
};
this.cache.set(url, result);
return result;
} catch (error) {
console.error(`Failed to crawl ${url}:`, error);
return {
mainContent: '',
contactInfo: '',
aboutInfo: '',
structuredData: {}
};
}
}
private async fetchPage(url: string): Promise<string> {
try {
const response = await axios.get(url, {
timeout: 10000,
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; BizSearch/1.0; +http://localhost:3000/about)',
}
});
return response.data;
} catch (error) {
console.error(`Failed to fetch ${url}:`, error);
return '';
}
}
private findContactPage($: cheerio.CheerioAPI, baseUrl: string): string | null {
const contactLinks = $('a[href*="contact"], a:contains("Contact")');
if (contactLinks.length > 0) {
const href = contactLinks.first().attr('href');
return href ? new URL(href, baseUrl).toString() : null;
}
return null;
}
private findAboutPage($: cheerio.CheerioAPI, baseUrl: string): string | null {
const aboutLinks = $('a[href*="about"], a:contains("About")');
if (aboutLinks.length > 0) {
const href = aboutLinks.first().attr('href');
return href ? new URL(href, baseUrl).toString() : null;
}
return null;
}
private extractStructuredData($: cheerio.CheerioAPI): any {
const structuredData: any[] = [];
$('script[type="application/ld+json"]').each((_, element) => {
try {
const data = JSON.parse($(element).html() || '{}');
structuredData.push(data);
} catch (error) {
console.error('Failed to parse structured data:', error);
}
});
return structuredData;
}
}

View file

@ -0,0 +1,71 @@
import { supabase } from '../supabase';
import { BusinessData } from '../searxng';
export class CacheService {
static async getCachedResults(category: string, location: string): Promise<BusinessData[] | null> {
try {
const { data, error } = await supabase
.from('search_cache')
.select('results')
.eq('category', category.toLowerCase())
.eq('location', location.toLowerCase())
.gt('expires_at', new Date().toISOString())
.order('created_at', { ascending: false })
.limit(1)
.single();
if (error) throw error;
return data ? data.results : null;
} catch (error) {
console.error('Cache lookup failed:', error);
return null;
}
}
static async cacheResults(
category: string,
location: string,
results: BusinessData[],
expiresInDays: number = 7
): Promise<void> {
try {
const expiresAt = new Date();
expiresAt.setDate(expiresAt.getDate() + expiresInDays);
const { error } = await supabase
.from('search_cache')
.insert({
query: `${category} in ${location}`,
category: category.toLowerCase(),
location: location.toLowerCase(),
results,
expires_at: expiresAt.toISOString()
});
if (error) throw error;
} catch (error) {
console.error('Failed to cache results:', error);
}
}
static async updateCache(
category: string,
location: string,
newResults: BusinessData[]
): Promise<void> {
try {
const { error } = await supabase
.from('search_cache')
.update({
results: newResults,
updated_at: new Date().toISOString()
})
.eq('category', category.toLowerCase())
.eq('location', location.toLowerCase());
if (error) throw error;
} catch (error) {
console.error('Failed to update cache:', error);
}
}
}

View file

@ -0,0 +1,107 @@
import { OllamaService } from './ollamaService';
interface ValidatedBusinessData {
name: string;
phone: string;
email: string;
address: string;
description: string;
hours?: string;
isValid: boolean;
}
export class DataValidationService {
private ollama: OllamaService;
constructor() {
this.ollama = new OllamaService();
}
async validateAndCleanData(rawText: string): Promise<ValidatedBusinessData> {
try {
const prompt = `
You are a business data validation expert. Extract and validate business information from the following text.
Return ONLY a JSON object with the following format, nothing else:
{
"name": "verified business name",
"phone": "formatted phone number or N/A",
"email": "verified email address or N/A",
"address": "verified physical address or N/A",
"description": "short business description",
"hours": "business hours if available",
"isValid": boolean
}
Rules:
1. Phone numbers should be in (XXX) XXX-XXXX format
2. Addresses should be properly formatted with street, city, state, zip
3. Remove any irrelevant text from descriptions
4. Set isValid to true only if name and at least one contact method is found
5. Clean up any obvious formatting issues
6. Validate email addresses for proper format
Text to analyze:
${rawText}
`;
const response = await this.ollama.generateResponse(prompt);
try {
// Find the JSON object in the response
const jsonMatch = response.match(/\{[\s\S]*\}/);
if (!jsonMatch) {
throw new Error('No JSON found in response');
}
const result = JSON.parse(jsonMatch[0]);
return this.validateResult(result);
} catch (parseError) {
console.error('Failed to parse Ollama response:', parseError);
throw parseError;
}
} catch (error) {
console.error('Data validation failed:', error);
return {
name: 'Unknown',
phone: 'N/A',
email: 'N/A',
address: 'N/A',
description: '',
hours: '',
isValid: false
};
}
}
private validateResult(result: any): ValidatedBusinessData {
// Ensure all required fields are present
const validated: ValidatedBusinessData = {
name: this.cleanField(result.name) || 'Unknown',
phone: this.formatPhone(result.phone) || 'N/A',
email: this.cleanField(result.email) || 'N/A',
address: this.cleanField(result.address) || 'N/A',
description: this.cleanField(result.description) || '',
hours: this.cleanField(result.hours),
isValid: Boolean(result.isValid)
};
return validated;
}
private cleanField(value: any): string {
if (!value || typeof value !== 'string') return '';
return value.trim().replace(/\s+/g, ' ');
}
private formatPhone(phone: string): string {
if (!phone || phone === 'N/A') return 'N/A';
// Extract digits
const digits = phone.replace(/\D/g, '');
if (digits.length === 10) {
return `(${digits.slice(0,3)}) ${digits.slice(3,6)}-${digits.slice(6)}`;
}
return phone;
}
}

View file

@ -0,0 +1,53 @@
import axios from 'axios';
import { env } from '../../config/env';
import { supabase } from '../supabase';
export class HealthCheckService {
static async checkOllama(): Promise<boolean> {
try {
const response = await axios.get(`${env.ollama.url}/api/tags`);
return response.status === 200;
} catch (error) {
console.error('Ollama health check failed:', error);
return false;
}
}
static async checkSearxNG(): Promise<boolean> {
try {
const response = await axios.get(`${env.searxng.currentUrl}/config`);
return response.status === 200;
} catch (error) {
try {
const response = await axios.get(`${env.searxng.instances[0]}/config`);
return response.status === 200;
} catch (fallbackError) {
console.error('SearxNG health check failed:', error);
return false;
}
}
}
static async checkSupabase(): Promise<boolean> {
try {
console.log('Checking Supabase connection...');
console.log('URL:', env.supabase.url);
// Just check if we can connect and query, don't care about results
const { error } = await supabase
.from('businesses')
.select('count', { count: 'planned', head: true });
if (error) {
console.error('Supabase query error:', error);
return false;
}
console.log('Supabase connection successful');
return true;
} catch (error) {
console.error('Supabase connection failed:', error);
return false;
}
}
}

View file

@ -0,0 +1,36 @@
import axios from 'axios';
import { env } from '../../config/env';
interface OllamaResponse {
response: string;
context?: number[];
}
export class OllamaService {
private url: string;
private model: string;
constructor() {
this.url = env.ollama.url;
this.model = env.ollama.model;
}
async complete(prompt: string): Promise<string> {
try {
const response = await axios.post(`${this.url}/api/generate`, {
model: this.model,
prompt: prompt,
stream: false,
options: {
temperature: 0.7,
top_p: 0.9
}
});
return response.data.response;
} catch (error) {
console.error('Ollama completion failed:', error);
throw error;
}
}
}

View file

@ -0,0 +1,93 @@
import { createClient } from '@supabase/supabase-js';
import { env } from '../../config/env';
import { BusinessData } from '../searxng';
export class SupabaseService {
private supabase;
constructor() {
this.supabase = createClient(env.supabase.url, env.supabase.anonKey);
}
async upsertBusinesses(businesses: BusinessData[]): Promise<void> {
try {
console.log('Upserting businesses to Supabase:', businesses.length);
for (const business of businesses) {
try {
// Create a unique identifier based on multiple properties
const identifier = [
business.name.toLowerCase(),
business.phone?.replace(/\D/g, ''),
business.address?.toLowerCase(),
business.website?.toLowerCase()
]
.filter(Boolean) // Remove empty values
.join('_') // Join with underscore
.replace(/[^a-z0-9]/g, '_'); // Replace non-alphanumeric chars
// Log the data being inserted
console.log('Upserting business:', {
id: identifier,
name: business.name,
phone: business.phone,
email: business.email,
address: business.address,
rating: business.rating,
website: business.website,
location: business.location
});
// Check if business exists
const { data: existing, error: selectError } = await this.supabase
.from('businesses')
.select('rating, search_count')
.eq('id', identifier)
.single();
if (selectError && selectError.code !== 'PGRST116') {
console.error('Error checking existing business:', selectError);
}
// Prepare upsert data
const upsertData = {
id: identifier,
name: business.name,
phone: business.phone || null,
email: business.email || null,
address: business.address || null,
rating: existing ? Math.max(business.rating, existing.rating) : business.rating,
website: business.website || null,
logo: business.logo || null,
source: business.source || null,
description: business.description || null,
latitude: business.location?.lat || null,
longitude: business.location?.lng || null,
last_updated: new Date().toISOString(),
search_count: existing ? existing.search_count + 1 : 1
};
console.log('Upserting with data:', upsertData);
const { error: upsertError } = await this.supabase
.from('businesses')
.upsert(upsertData, {
onConflict: 'id'
});
if (upsertError) {
console.error('Error upserting business:', upsertError);
console.error('Failed business data:', upsertData);
} else {
console.log(`Successfully upserted business: ${business.name}`);
}
} catch (businessError) {
console.error('Error processing business:', business.name, businessError);
}
}
} catch (error) {
console.error('Error saving businesses to Supabase:', error);
throw error;
}
}
}

42
src/lib/supabase.ts Normal file
View file

@ -0,0 +1,42 @@
import { createClient } from '@supabase/supabase-js';
import { env } from '../config/env';
// Validate Supabase configuration
if (!env.supabase.url || !env.supabase.anonKey) {
throw new Error('Missing Supabase configuration');
}
// Create Supabase client
export const supabase = createClient(
env.supabase.url,
env.supabase.anonKey,
{
auth: {
autoRefreshToken: true,
persistSession: true
}
}
);
// Test the connection on startup
async function testConnection() {
try {
console.log('Checking Supabase connection...');
console.log('URL:', env.supabase.url);
const { error } = await supabase
.from('businesses')
.select('count', { count: 'planned', head: true });
if (error) {
console.error('❌ Supabase initialization error:', error);
} else {
console.log('✅ Supabase connection initialized successfully');
}
} catch (error) {
console.error('❌ Failed to initialize Supabase:', error);
}
}
// Run the test
testConnection().catch(console.error);

28
src/lib/types.ts Normal file
View file

@ -0,0 +1,28 @@
export interface BusinessData {
id?: string;
name: string;
phone?: string;
email?: string;
address?: string;
rating?: number;
website?: string;
logo?: string;
source?: string;
description?: string;
location?: {
lat: number;
lng: number;
};
latitude?: number;
longitude?: number;
place_id?: string;
photos?: string[];
openingHours?: string[];
distance?: {
value: number;
unit: string;
};
last_updated?: string;
search_count?: number;
created_at?: string;
}

36
src/lib/utils/cache.ts Normal file
View file

@ -0,0 +1,36 @@
interface CacheItem<T> {
data: T;
timestamp: number;
}
export class Cache<T> {
private store = new Map<string, CacheItem<T>>();
private ttl: number;
constructor(ttlMinutes: number = 60) {
this.ttl = ttlMinutes * 60 * 1000;
}
set(key: string, value: T): void {
this.store.set(key, {
data: value,
timestamp: Date.now()
});
}
get(key: string): T | null {
const item = this.store.get(key);
if (!item) return null;
if (Date.now() - item.timestamp > this.ttl) {
this.store.delete(key);
return null;
}
return item.data;
}
clear(): void {
this.store.clear();
}
}

View file

@ -0,0 +1,30 @@
export function normalizePhoneNumber(phone: string): string {
return phone.replace(/[^\d]/g, '');
}
export function normalizeAddress(address: string): string {
// Remove common suffixes and standardize format
return address
.toLowerCase()
.replace(/(street|st\.?|avenue|ave\.?|road|rd\.?)/g, '')
.trim();
}
export function extractZipCode(text: string): string | null {
const match = text.match(/\b\d{5}(?:-\d{4})?\b/);
return match ? match[0] : null;
}
export function calculateReliabilityScore(business: Business): number {
let score = 0;
// More complete data = higher score
if (business.phone) score += 2;
if (business.website) score += 1;
if (business.email) score += 1;
if (business.hours) score += 2;
if (business.services.length > 0) score += 1;
if (business.reviewCount > 10) score += 2;
return score;
}

View file

@ -0,0 +1,23 @@
export class RateLimiter {
private timestamps: number[] = [];
private readonly windowMs: number;
private readonly maxRequests: number;
constructor(windowMs: number = 60000, maxRequests: number = 30) {
this.windowMs = windowMs;
this.maxRequests = maxRequests;
}
async waitForSlot(): Promise<void> {
const now = Date.now();
this.timestamps = this.timestamps.filter(time => now - time < this.windowMs);
if (this.timestamps.length >= this.maxRequests) {
const oldestRequest = this.timestamps[0];
const waitTime = this.windowMs - (now - oldestRequest);
await new Promise(resolve => setTimeout(resolve, waitTime));
}
this.timestamps.push(now);
}
}

View file

@ -0,0 +1,119 @@
import * as cheerio from 'cheerio';
interface StructuredData {
name?: string;
email?: string;
phone?: string;
address?: string;
socialProfiles?: string[];
openingHours?: Record<string, string>;
description?: string;
}
export class StructuredDataParser {
static parse($: cheerio.CheerioAPI): StructuredData[] {
const results: StructuredData[] = [];
// Parse JSON-LD
$('script[type="application/ld+json"]').each((_, element) => {
try {
const data = JSON.parse($(element).html() || '{}');
if (Array.isArray(data)) {
data.forEach(item => this.parseStructuredItem(item, results));
} else {
this.parseStructuredItem(data, results);
}
} catch (e) {
console.error('Error parsing JSON-LD:', e);
}
});
// Parse microdata
$('[itemtype]').each((_, element) => {
const type = $(element).attr('itemtype');
if (type?.includes('Organization') || type?.includes('LocalBusiness')) {
const data: StructuredData = {
name: $('[itemprop="name"]', element).text(),
email: $('[itemprop="email"]', element).text(),
phone: $('[itemprop="telephone"]', element).text(),
address: this.extractMicrodataAddress($, element),
socialProfiles: this.extractSocialProfiles($, element)
};
results.push(data);
}
});
// Parse RDFa
$('[typeof="Organization"], [typeof="LocalBusiness"]').each((_, element) => {
const data: StructuredData = {
name: $('[property="name"]', element).text(),
email: $('[property="email"]', element).text(),
phone: $('[property="telephone"]', element).text(),
address: this.extractRdfaAddress($, element),
socialProfiles: this.extractSocialProfiles($, element)
};
results.push(data);
});
return results;
}
private static parseStructuredItem(data: any, results: StructuredData[]): void {
if (data['@type'] === 'Organization' || data['@type'] === 'LocalBusiness') {
results.push({
name: data.name,
email: data.email,
phone: data.telephone,
address: this.formatAddress(data.address),
socialProfiles: this.extractSocialUrls(data),
openingHours: this.parseOpeningHours(data.openingHours),
description: data.description
});
}
}
private static formatAddress(address: any): string | undefined {
if (typeof address === 'string') return address;
if (typeof address === 'object') {
const parts = [
address.streetAddress,
address.addressLocality,
address.addressRegion,
address.postalCode,
address.addressCountry
].filter(Boolean);
return parts.join(', ');
}
return undefined;
}
private static extractSocialUrls(data: any): string[] {
const urls: string[] = [];
if (data.sameAs) {
if (Array.isArray(data.sameAs)) {
urls.push(...data.sameAs);
} else if (typeof data.sameAs === 'string') {
urls.push(data.sameAs);
}
}
return urls;
}
private static parseOpeningHours(hours: any): Record<string, string> | undefined {
if (!hours) return undefined;
if (Array.isArray(hours)) {
const schedule: Record<string, string> = {};
hours.forEach(spec => {
const match = spec.match(/^(\w+)(-\w+)?\s+(\d\d:\d\d)-(\d\d:\d\d)$/);
if (match) {
schedule[match[1]] = `${match[3]}-${match[4]}`;
}
});
return schedule;
}
return undefined;
}
// ... helper methods for microdata and RDFa parsing ...
}

88
src/routes/api.ts Normal file
View file

@ -0,0 +1,88 @@
import { Router } from 'express';
import { searchBusinesses } from '../lib/searxng';
import { categories } from '../lib/categories';
import { supabase } from '../lib/supabase';
import { BusinessData } from '../lib/types';
const router = Router();
// Categories endpoint
router.get('/categories', (req, res) => {
res.json(categories);
});
// Search endpoint
router.get('/search', async (req, res) => {
try {
const query = req.query.q as string;
const [searchTerm, location] = query.split(' in ');
if (!query) {
return res.status(400).json({ error: 'Search query is required' });
}
// Set headers for streaming response
res.setHeader('Content-Type', 'application/json');
res.setHeader('Transfer-Encoding', 'chunked');
// First, search in Supabase
const { data: existingResults, error: dbError } = await supabase
.from('businesses')
.select('*')
.or(`name.ilike.%${searchTerm}%, description.ilike.%${searchTerm}%`)
.ilike('address', `%${location}%`);
if (dbError) {
console.error('Supabase search error:', dbError);
}
// Send existing results immediately if there are any
if (existingResults && existingResults.length > 0) {
const chunk = JSON.stringify({
source: 'database',
results: existingResults
}) + '\n';
res.write(chunk);
}
// Start background search
const searchPromise = searchBusinesses(query, {
onProgress: (status, progress) => {
const chunk = JSON.stringify({
source: 'search',
status,
progress,
}) + '\n';
res.write(chunk);
}
});
const results = await searchPromise;
// Send final results
const finalChunk = JSON.stringify({
source: 'search',
results,
complete: true
}) + '\n';
res.write(finalChunk);
res.end();
} catch (error: unknown) {
console.error('Search error:', error);
const errorResponse = {
error: 'An error occurred while searching',
details: error instanceof Error ? error.message : 'Unknown error'
};
// Only send error response if headers haven't been sent
if (!res.headersSent) {
res.status(500).json(errorResponse);
} else {
res.write(JSON.stringify(errorResponse));
res.end();
}
}
});
export default router;

102
src/test-supabase.ts Normal file
View file

@ -0,0 +1,102 @@
import { createClient } from '@supabase/supabase-js';
import dotenv from 'dotenv';
// Load environment variables
dotenv.config();
async function testSupabaseConnection() {
console.log('Testing Supabase connection...');
console.log('URL:', process.env.SUPABASE_URL);
console.log('Key length:', process.env.SUPABASE_KEY?.length || 0);
try {
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_KEY!,
{
auth: {
autoRefreshToken: true,
persistSession: true
}
}
);
// Test businesses table
console.log('\nTesting businesses table:');
const testBusiness = {
id: 'test_' + Date.now(),
name: 'Test Business',
phone: '123-456-7890',
email: 'test@example.com',
address: '123 Test St',
rating: 5,
website: 'https://test.com',
source: 'test',
description: 'Test description',
latitude: 39.7392,
longitude: -104.9903,
search_count: 1,
created_at: new Date().toISOString()
};
const { error: insertBusinessError } = await supabase
.from('businesses')
.insert([testBusiness])
.select();
if (insertBusinessError) {
console.error('❌ INSERT business error:', insertBusinessError);
} else {
console.log('✅ INSERT business OK');
// Clean up
await supabase.from('businesses').delete().eq('id', testBusiness.id);
}
// Test searches table
console.log('\nTesting searches table:');
const testSearch = {
query: 'test query',
location: 'test location',
results_count: 0,
timestamp: new Date().toISOString()
};
const { error: insertSearchError } = await supabase
.from('searches')
.insert([testSearch])
.select();
if (insertSearchError) {
console.error('❌ INSERT search error:', insertSearchError);
} else {
console.log('✅ INSERT search OK');
}
// Test cache table
console.log('\nTesting cache table:');
const testCache = {
key: 'test_key_' + Date.now(),
value: { test: true },
created_at: new Date().toISOString(),
expires_at: new Date(Date.now() + 3600000).toISOString()
};
const { error: insertCacheError } = await supabase
.from('cache')
.insert([testCache])
.select();
if (insertCacheError) {
console.error('❌ INSERT cache error:', insertCacheError);
} else {
console.log('✅ INSERT cache OK');
// Clean up
await supabase.from('cache').delete().eq('key', testCache.key);
}
} catch (error: any) {
console.error('❌ Unexpected error:', error);
}
}
testSupabaseConnection().catch(console.error);

94
src/tests/supabaseTest.ts Normal file
View file

@ -0,0 +1,94 @@
import '../config/env'; // Load env vars first
import { CacheService } from '../lib/services/cacheService';
import type { PostgrestError } from '@supabase/supabase-js';
import { env } from '../config/env';
async function testSupabaseConnection() {
console.log('\n🔍 Testing Supabase Connection...');
console.log('Using Supabase URL:', env.supabase.url);
try {
// Test data
const testData = {
category: 'test_category',
location: 'test_location',
results: [{
name: 'Test Business',
phone: '123-456-7890',
email: 'test@example.com',
address: '123 Test St, Test City, TS 12345',
rating: 95,
website: 'https://test.com',
logo: '',
source: 'test',
description: 'Test business description'
}]
};
console.log('\n1⃣ Testing write operation...');
await CacheService.cacheResults(
testData.category,
testData.location,
testData.results,
env.cache.durationDays
);
console.log('✅ Write successful');
console.log('\n2⃣ Testing read operation...');
const cachedResults = await CacheService.getCachedResults(
testData.category,
testData.location
);
if (cachedResults && cachedResults.length > 0) {
console.log('✅ Read successful');
console.log('\nCached data:', JSON.stringify(cachedResults[0], null, 2));
} else {
throw new Error('No results found in cache');
}
console.log('\n3⃣ Testing update operation...');
const updatedResults = [...testData.results];
updatedResults[0].rating = 98;
await CacheService.updateCache(
testData.category,
testData.location,
updatedResults
);
console.log('✅ Update successful');
console.log('\n✨ All tests passed! Supabase connection is working properly.\n');
} catch (error: unknown) {
console.error('\n❌ Test failed:');
if (error instanceof Error) {
console.error('Error message:', error.message);
// Check if it's a Supabase error by looking at the shape of the error object
const isSupabaseError = (err: any): err is PostgrestError =>
'code' in err && 'details' in err && 'hint' in err && 'message' in err;
if (error.message.includes('connection') || isSupabaseError(error)) {
console.log('\n📋 Troubleshooting steps:');
console.log('1. Check if your SUPABASE_URL and SUPABASE_ANON_KEY are correct in .env');
console.log('2. Verify that the search_cache table exists in your Supabase project');
console.log('3. Check if RLS policies are properly configured');
if (isSupabaseError(error)) {
console.log('\nSupabase error details:');
console.log('Code:', error.code);
console.log('Details:', error.details);
console.log('Hint:', error.hint);
}
}
} else {
console.error('Unknown error:', error);
}
process.exit(1);
}
}
// Run the test
testSupabaseConnection();

26
src/tests/testSearch.ts Normal file
View file

@ -0,0 +1,26 @@
import { searchSearxng } from '../lib/searxng';
async function testSearchEngine() {
try {
console.log('Testing SearxNG connection...');
const results = await searchSearxng('plumbers in Denver', {
engines: ['google', 'bing', 'duckduckgo'],
pageno: 1
});
if (results && results.results && results.results.length > 0) {
console.log('✅ Search successful!');
console.log('Number of results:', results.results.length);
console.log('First result:', results.results[0]);
} else {
console.log('❌ No results found');
}
} catch (error) {
console.error('❌ Search test failed:', error);
console.error('Make sure SearxNG is running on http://localhost:4000');
}
}
testSearchEngine();

28
src/types/business.ts Normal file
View file

@ -0,0 +1,28 @@
export interface Business {
id: string;
name: string;
phone: string;
address: string;
city: string;
state: string;
zip: string;
category: string[];
rating: number;
reviewCount: number;
license?: string;
services: string[];
hours: Record<string, string>;
website?: string;
email?: string;
verified: boolean;
lastUpdated: Date;
}
export interface SearchParams {
location: string;
category?: string;
radius?: number;
minRating?: number;
sortBy?: 'rating' | 'distance' | 'reviewCount';
verified?: boolean;
}

View file

@ -1,18 +1,17 @@
{
"compilerOptions": {
"lib": ["ESNext"],
"module": "Node16",
"moduleResolution": "Node16",
"target": "ESNext",
"outDir": "dist",
"sourceMap": false,
"target": "ES2020",
"module": "commonjs",
"lib": ["es2020", "DOM"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"experimentalDecorators": true,
"emitDecoratorMetadata": true,
"allowSyntheticDefaultImports": true,
"skipLibCheck": true,
"skipDefaultLibCheck": true
"forceConsistentCasingInFileNames": true,
"moduleResolution": "node",
"resolveJsonModule": true
},
"include": ["src"],
"exclude": ["node_modules", "**/*.spec.ts"]
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}