Perplexica/docs/ETHICAL_SCRAPING.md
eligrinfeld fde5b5e318 Add project files:
- Add database initialization scripts
- Add configuration files
- Add documentation
- Add public assets
- Add source code structure
- Update README
2025-01-04 17:22:46 -07:00

2.5 KiB

Ethical Web Scraping Guidelines

Core Principles

  1. Respect Robots.txt

    • Always check and honor robots.txt directives
    • Cache robots.txt to reduce server load
    • Default to conservative behavior when uncertain
  2. Proper Identification

    • Use clear, identifiable User-Agent strings
    • Provide contact information
    • Be transparent about your purpose
  3. Rate Limiting

    • Implement conservative rate limits
    • Use exponential backoff for errors
    • Distribute requests over time
  4. Data Usage

    • Only collect publicly available business information
    • Respect privacy and data protection laws
    • Provide clear opt-out mechanisms
    • Keep data accurate and up-to-date
  5. Technical Considerations

    • Cache results to minimize requests
    • Handle errors gracefully
    • Monitor and log access patterns
    • Use structured data when available

Implementation

  1. Request Headers
const headers = {
  'User-Agent': 'BizSearch/1.0 (+https://bizsearch.com/about)',
  'Accept': 'text/html,application/xhtml+xml',
  'From': 'contact@bizsearch.com'
};
  1. Rate Limiting
const rateLimits = {
  requestsPerMinute: 10,
  requestsPerHour: 100,
  requestsPerDomain: 20
};
  1. Caching
const cacheSettings = {
  ttl: 24 * 60 * 60, // 24 hours
  maxSize: 1000 // entries
};

Opt-Out Process

  1. Business owners can opt-out by:

    • Submitting a form on our website
    • Emailing opt-out@bizsearch.com
    • Adding a meta tag: <meta name="bizsearch" content="noindex">
  2. We honor opt-outs within:

    • 24 hours for direct requests
    • 72 hours for cached data
  1. Data Protection

    • GDPR compliance for EU businesses
    • CCPA compliance for California businesses
    • Regular data audits and cleanup
  2. Attribution

    • Clear source attribution
    • Last-updated timestamps
    • Data accuracy disclaimers

Best Practices

  1. Before Scraping

    • Check robots.txt
    • Verify site status
    • Review terms of service
    • Look for API alternatives
  2. During Scraping

    • Monitor response codes
    • Respect server hints
    • Implement backoff strategies
    • Log access patterns
  3. After Scraping

    • Verify data accuracy
    • Update cache entries
    • Clean up old data
    • Monitor opt-out requests

Contact

For questions or concerns about our scraping practices: